Private Evaporative Cooling
WORK IN PROGRESS (WIP)
Written by Trang Le – 2016- in collaboration with Bill White
University of Tulsa – insilico.utulsa.edu
Privacy Preserving Relief-F and Random Forest with Evaporative Cooling Feature Selection
Classification of individuals into disease categories from biological data is an important application of statistical learning in biomedical research. Feature selection methods are incorporated into classifier training to avoid false positive predictions caused by overfitting in very high-dimensional biological data. Recently, in domains outside of bioinformatics with large sample sizes, methods have been proposed to avoid overfitting based on differential privacy, such as differentially private random forest algorithm (Singh, ICACCI, 2014) and a framework for reusable holdout sets (Dwork, Science, 2015). We introduce a stochastic privacy-preserving feature selection algorithm to optimize random forest classification based on an analogy with the evaporative cooling of an atomic gas. The evaporation represents the backwards elimination of features and the temperature represents the privacy threshold. Using simulated data, we compare test accuracies of evaporative cooling selection with thresholdout, standard random forest, and thresholdout RF on simulated data with embedded differential co-expression networks.