Last Updated: 11/22/16

## Sample Size Planning for Developing Classifiers Using High Dimensional Data

##### Kevin Dobbin and Richard Simon, Biostatistics 8:101-17, 2007.

Kevin Dobbin, Yingdong Zhao and Richard Simon, Clinical Cancer Research 14:108-114, 2008.

The program only considers sample sizes below 300. If >300 samples are needed, then an error message is returned indicating this. |

Note: This program provides estimates of the sample size required for a training set in order to ensure the resulting binary classifier has an expected accuracy within a tolerance of the optimal accuracy. Classifier performance should also be assessed. This can be done by cross-validation (resampling) or by applying the classifier to an independent validation set. The sample sizes given by this program do not provide the sample size required for a validation set. The sample size provided here also does not address the precision of a cross-validated estimate of prediction accuracy. |

Definition of standardized fold change: The standardized fold change is the difference between the class means divided by the within-class standard deviation, on the base 2 log scale. For example, if the raw fold change between the classes is 2, with log2(2)=1, and the within-class standard deviation for a typical gene is 0.71, then the standardized fold change is 1/0.71 =~ 1.4. The 0.71 here is a typical median variance observed on human tumors in microarray experiments. What should be input to the program here is an estimate of the absolute value of this ratio for the gene with the largest standardized fold change (multiplied by a shrinkage factor, and we recommend using 0.80 as the shrinkage factor). |

Definition of prevalence: It is assumed that the prevalence in the sample is equal to the prevalence in the population, which should roughly be true under random sampling with reasonably large sample sizes. If, on the other hand, the sample is drawn in such a way that each class is equally represented, then the prevalence should be set to 50%. |

Definition of tolerance: The sample size n is chosen to ensure that the expected (average) accuracy of the resulting classifier is within the tolerance of the best possible classifier. For example, a tolerance of 0.10 results in a sample size that has an expected accuracy within 0.10 of the best accuracy possible for the population. In other words, if samples of size n are repeatedly drawn from the population, the average accuracy of the resulting classifiers would be within 0.10 of the best possible. |

How to use this program with survival data? This program can be used heuristically for developing classifiers for predicting risk groups with survival data. Conceptually the patients with survival time greater than a landmark time T can be considered the low risk class and those with survival less than T as the high risk class. T should be selected to approximately maximize the difference in survival distributions between the two classes and the prevalence determined accordingly. For example, if 60% of patients fail mostly within 3 years, and the remaining are cured, it would be reasonable to use T=3 years and a prevalence of 60% for purposes of sample size planning. This heuristic approach should work as long as the proportion of patients who will be censored (lost to follow-up) before time T is small. |