NCI logo


Development of Statistical Methods for Microarray Analysis

One of the main focuses of the Molecular Statistics and Bioinformatics Section is developing methodology to assist in the analysis of gene expression data derived from microarray systems. We are interested in developing statistical methods for both the "early phase" (design of arrays, image analysis, data quality assessment, ratio calculation and normalization, etc.) and "discovery phase" (cluster analysis, class prediction, tests of significance, etc.) of microarray research projects. Several methods we have developed and are developing are discussed below.

Assessment Of Cluster Reproducibility

Hierarchical cluster analysis is a popular method for examining the relationships between genes or experiments based on gene expression data from microarray experiments. Cluster analysis can be helpful for determining what genes have the most similar expression across experiments and what experiments have the most similar gene expression profiles. However, cluster algorithms always result in the formation of clusters, even for data sets where little or no underlying structure is present. Furthermore, statistical significance cannot be assigned to particular clusters using standard statistical techniques.

Therefore, we have developed measures that are helpful in assessing the reproducibility of individual clusters. The fundamental idea behind these measures is that the most believable clusters are those that would persist given small perturbations of the data, where the perturbations represent an anticipated level of noise in gene expression measurements due to assay variability and variation due to sub-sampling of specimens. This method was used to assess the reproducibility of a novel clustering of melanoma samples and cell lines in a collaboration with a group of investigators from the National Human Genome Research Institute1.


Class Prediction of Tumor Subtypes

One of the potential uses of microarray data is for the classification of specimens into phenotypic, prognostic or predictive groups based solely on their gene expression profiles. For example, a substantial proportion of node-negative breast cancers will be cured by surgery alone (with no further treatment). Determining which patients have such cancers would be of great clinical benefit. Though breast cancers that are curable by surgery alone may not be phenotypically distinguishable from others, it is possible they have distinguishable gene expression patterns that can be used as the basis of classification.

We have developed a method for the classification of specimens into one of two pre-determined classes based on gene expression data using a compound covariate predictor2. The predictor is a linear combination of the log-expression ratios of genes differentially expressed between the two classes, with the log-ratio of each gene weighted by the univariate two-sample t-statistic for the gene. A classification threshold is selected that assigns a specimen into one of the two classes based on the value of its compound covariate predictor. We use a cross-validated approach for the classification of specimens and have developed a permutation test for determining the significance of resulting misclassification error rates. We are applying this method to microarray data for various types of cancer in collaboration with researchers within the National Institutes of Health.


Comparison of Microarray Designs for Class Comparison and Class Discovery

Microarray design can have a significant impact on the researchers' ability to identify genes associated with cancer phenotypes, and to discover new taxonomies for tumors from expression profiles. Complementary DNA microarrays are based on competitive hybridization of pairs of RNA samples to the array. Frequently, a sub-sample of a common reference RNA sample is used as one of the two samples hybridized on each microarray. Recently, other experimental designs for allocating samples to arrays have been proposed. However, the relative merits of microarray designs have not been thoroughly evaluated.

We have developed a statistical model that facilitates the evaluation of designs when the goal is to compare pre-specified groups, and when the goal is to seek out new taxonomies. In all cases, design description must include the level at which samples are to be drawn, e.g. which are to represent multiple aliquots from a single RNA source, and which aliquots from different sources. When comparing pre-specified groups, the relative efficiencies of different designs are shown to depend on the relation between intra- and inter-sample variability. When seeking out new taxonomies, both analytic results and Monte Carlo methods show that for certain designs the ability to identify meaningful clusters breaks down as the sample size increases. These results suggest some relatively straightforward guidelines for selecting a microarray design depending on the objectives of the experiment.


Prognostic Prediction Using Gene Expression Profiles

We are extending our research on tumor class prediction for binary outcome data to cases where outcome is continuous. Specifically, we are developing methodology for associating patterns of gene expression with survival time in patients diagnosed with cancer. We adjust for standard prognostic factors in developing a gene expression prognostic index so that, if significance is obtained for the resulting index, the index provides prognostic information beyond current standards.


Controlling the number of false discoveries: Application to high dimensional genomic data

A straightforward approach to the identification of genes expressed differentially between different groups of individuals is to perform a univariate analysis of group mean differences for each gene, and then identify those genes that are most statistically significant. Using nominal significance levels (unadjusted for the multiple comparisons) will lead to the identification of many genes that truly are not differentially expressed, "false discoveries". A reasonable strategy in many situations is to allow a small number of false discoveries, or a small proportion of the identified genes to be false discoveries. Although previous work has considered control for the expected proportion of false discoveries, we show these methods may be inadequate. We propose two stepwise permutation-based procedures to control with specified confidence the actual number of false discoveries and approximately the actual proportion of false discoveries. Limited simulation studies demonstrate substantial gain in sensitivity to detect truly differentially expressed genes even when allowing as few as one or two false discoveries. We apply these new methods to analyze a microarray data set consisting of measurements on approximately 9000 genes in paired tumor specimens, collected both before and after chemotherapy on 20 breast cancer patients. The methods described are broadly applicable to the problem of identifying which variables of any large set of measured variables differ between pre-specified groups.


Identifying pre-post chemotherapy differences in gene expression in breast tumors: a statistical method appropriate for this aim

Although widely used for the analysis of gene expression microarray data, cluster analysis may not be the most appropriate statistical technique for some study aims. We demonstrate this by considering a previous analysis of microarray data obtained on breast tumor specimens, many of which were paired specimens from the same patient before and after chemotherapy. Reanalyzing the data using statistical methods that appropriately utilize the paired differences for identification of differentially expressed genes, we find 17 genes that we can confidently identify as more expressed after chemotherapy than before. These findings were not reported by the original investigators who analyzed the data using cluster analysis techniques.


Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data

Recent technological advances such as cDNA microarray technology have made it possible to simultaneously interrogate thousands of genes in a biological specimen. A cDNA microarray experiment produces a gene expression "profile". Often interest lies in discovering novel subgroupings, or "clusters", of specimens based on their profiles, for example identification of new tumor taxonomies. Cluster analysis techniques such as hierarchical clustering and self-organizing maps have frequently been used for investigating structure in microarray data. However, clustering algorithms always detect clusters, even on random data, and it is easy to misinterpret the results without some objective measure of the reproducibility of the clusters. We present statistical methods for testing for overall clustering of gene expression profiles, and we define easily interpretable measures of cluster-specific reproducibility that facilitate understanding of the clustering structure. We apply these methods to elucidate structure in cDNA microarray gene expression profiles obtained on melanoma tumors and on prostate specimens.


Multiple comparisons methods applied to multivariate Cox regression models

When clinical outcome data is available on a set of specimens that have been molecularly profiled by cDNA arrays, it is of interest to identify genes whose expression levels are associated with survival. One approach to this problem is to perform a univariate survival analysis relating survival to expression level for each gene, and then identify those genes that are most statistically significant. Using nominal significance levels (unadjusted for the multiple comparisons) will lead to the identification of many genes that truly are not associated with survival, "false discoveries". If there is no adjustment for other covariates in the survival model, step-down permutation methods that control the number or proportion of false discoveries can be readily adapted to this setting. However, identification of genes that remain significantly associated with survival after adjustment for standard prognostic variables is of particular interest. In this situation, permutation techniques cannot be directly applied due to likely correlations among the genes and standard prognostic variables. We are exploring modified permutation and bootstrapping methods to address this problem.


Statistical treatment of saturated spots in cDNA microarray data

Saturation of fluorescent signal may be encountered for spots on a cDNA microarray corresponding to highly expressed genes. Na´ve thresholding of pixel levels at the saturation point will lead to underestimation of total intensity for the spot. We explore some statistical methods to adjust for saturation and provide less biased estimates of total spot intensity.


1. Bittner, M. et al., Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature, 406:536-540, 2000.

2. Tukey, J.W., Tightening the clinical trial, Controlled Clinical Trials, 14:266-285, 1993.

Please send comments and suggestions to mailto

Updated on Nov. 2, 2015