Statistical Methods for Integrative Inference with Imperfect and Heterogeneous Data Sources

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Probability samples serve as a foundation for unbiased statistical inference but are often costly and prone to incomplete information, whereas non-probability samples are efficient and convenient but lack a unified framework for valid inference. This thesis develops a series of statistical methodologies for estimating the population mean of a response variable by integrating probability and non-probability samples across a range of imperfect-data settings commonly encountered in modern observational studies and survey research. Chapter 2 introduces a flexible likelihood-based framework for integrating a probability sample with fully observed covariates and a non-probability sample with a misclassified response measured through multiple surrogates. The proposed inference procedure enables consistent estimation of the population mean and improves efficiency by leveraging auxiliary information from the probability sample. Chapter 3 adapts the framework to settings where a common covariate is misclassified in both the probability and non-probability samples and measured through multiple sur-rogates. A likelihood-based estimator and a doubly robust (DR) extension are developed to integrate a probability sample with surrogate-measured covariate and no response and a non-probability sample with the same surrogate-measured covariate and fully observed response, with a shared set of fully observed auxiliary covariates in both samples, yielding consistent population mean estimation under partial model misspecification. Chapter 4 further considers joint misclassification of both a covariate and the response. A joint likelihood-based procedure is proposed to simultaneously recover the latent covari-ate and response by integrating a probability sample in which the covariate is misclassi-fied through multiple surrogates, other covariates are fully and precisely observed, and the response is unavailable, with a non-probability sample that contains the same surrogate-measured covariate, the same set of fully observed auxiliary covariates, and a response that is itself misclassified via multiple surrogates. The proposed integration strategy corrects for selection bias and achieves consistent population mean estimation under correct outcome model specification, with substantial gains in bias reduction and efficiency. Chapter 5 extends the framework to survival analysis by integrating a probability sample with complete covariates and no response and a non-probability sample with a right-censored time-to-event outcome. Parametric regression-based, inverse probability weighting (IPW), and doubly robust estimators are developed to estimate the population mean event time, with the DR estimator exhibiting strong robustness in simulation studies. Collectively, this thesis provides a unified likelihood-based framework for integrating probability and non-probability samples under various forms of data imperfection, advances population-level statistical inference in the presence of misclassification, missingness, and biased sampling mechanisms, and identifies promising directions for future methodological development in more complex data settings.

Description

Citation

Yu, Z. (2026). Statistical methods for integrative inference with imperfect and heterogeneous data sources (Doctoral thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.

Endorsement

Review

Supplemented By

Referenced By