Merging Datasets of Disparate Size: A Machine Learning Approach

James C. Davis, Holden A. Diethorn, Gerald R. Marschke, Andrew J. Wang

Jan 1, 2019

PDF

Abstract

Many papers in applied microeconomics seek answers to questions that cannot be answered using one dataset alone. Conventional practice involves merging together different data sources and keeping only those observations that match across all the data sources necessary for the analysis. When the datasets being merged have the same unit of observation, this conventional practice implicitly restricts the sample size to the number of observations in the smallest of these datasets. With the increasing prevalence of big data, researchers are more likely to merge datasets of vastly disparate size. We provide an alternative to the conventional method of merging datasets of disparate size that enables researchers to obtain a sample of observations equal to the number of observations in the biggest of the datasets. Our method relies on the use of machine learning techniques which effectively impute values of key variables from the smaller datasets onto the larger datasets. We show how to choose among competing machine learning models, how to properly assess the accuracy of their predictions/imputations, and provide a concrete example of this process through construction of a linked employer-employee dataset of the PhD workforce.