Survival Risk Group Prediction Analysis for Gene Expression Data

BRB-ArrayTools Development Team

2017-10-11

This package implements the survival risk group prediction analysis tool in BRB-ArrayTools (https://brb.nci.nih.gov/BRB-ArrayTools/). The principal components or penalized Cox regression method is employed to select a gene list for fitting the Cox proportional harzards model. It provides an assessment of whether the association of expression data to survival data is statistically significant. It also lets the user evaluate whether the expression data provides more accurate predictions than that provided by standard clinical covariates. Kaplan-Meier curves are given for 2-/3-risk groups obtained by leave-one-our or 10-fold cross validation. The ROC curve at a specified landmark time using the Nearest Neighbor Estimation method will also be provided.

When no clinical covariates are provided, the model with gene expression only will be used. If any covariates are provided, models with covariates only and with both gene expression and covariates will be considered. In the model with gene expression only, for each gene the gene expression is used in the Cox proportional hazards model. Significant genes are selected for testing the hypothesis if the expression is a predictor of survival. They will be used to create ‘super genes’ or the principal components. The Cox proportional hazards model is performed again by using these principal components. In the model with covariates only, the covariates are used in the Cox proportional hazards model. No gene expression profile is used in the risk group prediction. In the model with both gene expression and covariates, significant genes are selected by the likelihood ration test by comparing the model containing covariates only and the model containing both gene expression and covariates. So the significant genes are selected by testing the hypothesis whether the expression data is a predictor of survival given the covariates. The significant genes are used to form the principal components. The Cox proportonal hazards model is used again by using these principal components. After fitting the Cox proportional hazards model, we can obtain the prognostic index (the exponential component in the Cox proportional hazards model) for each training samples. These prognostic indices will then be used in predicting risk groups for left-out or new samples.

Quick Start

This package provides test.survRiskPredict for a quick start of survival risk group prediction anaysis
over one of the built-in sample data (i.e., “Perou”, “Pomeroy”).

library(survriskpred)
res <- test.survRiskPredict("Pomeroy",outputName = "survivalRiskPrediction", generateHTML = TRUE)
names(res)

It outputs an HTML file (C:\Users\YourUserName\Documents\Pomeroy\Output\SurvivalRiskPrediction\survivalRiskPrediction.html) with survival risk group prediction results as well as a list res including the following objects:

## [1] "predictNewSamplesModE"    "genesInClassifierModE"   
## [3] "loadingMatrixModE"        "predictRiskTrainingModE" 
## [5] "predictNewSamplesModCo"   "genesInClassifierModCo"  
## [7] "loadingMatrixModCo"       "predictRiskTrainingModCo"
## [9] "workPath"  

Here we give simple explanation about each object in res:

##                Array id Prognostic index Prediction
## Brain_MD_60 Brain_MD_60           -1.611        low
##     p-value % CV Support       ProbeSet Symbol
## 1  9.52e-05          100 L17131_rna1_at  HMGA1 
## 2 0.0001127           90      D63880_at NCAPD2
## 3 0.0002068           90      M73547_at  REEP5
## 4 0.0002393           70    D28473_s_at   IARS
## 5 0.0009122           20      S78296_at    INA
## 6 0.0009407           40      U15008_at SNRPD2
##                                        Name       EntrezID
## 1                   high mobility group AT-hook 1     3159
## 2         non-SMC condensin I complex, subunit D2     9918
## 3                    receptor accessory protein 5     7905
## 4                       isoleucyl-tRNA synthetase     3376
## 5 internexin neuronal intermediate filament protein   9118
## 6   small nuclear ribonucleoprotein D2 polypeptide    6633
##                Loading 1 Loading 2 Correlation 1 Correlation 2   Weights
## L17131_rna1_at -0.053385 -0.046797     -0.894055     -0.321994  0.757902
## D63880_at      -0.024108  0.002353     -0.590122      0.023660  0.359988
## M73547_at       0.024781 -0.038487      0.690807     -0.440791 -0.397288
## D28473_s_at    -0.010936 -0.007266     -0.282671     -0.077159  0.157008
## S78296_at       0.027138 -0.085981      0.580856     -0.756109 -0.468173
## U15008_at      -0.018775 -0.040807     -0.497391     -0.444177  0.248152
##          Array Survival time Censoring indicator Predicted risk
## 1   Brain_MD_1            11                   1           high
## 2   Brain_MD_2             5                   1            low
## 3   Brain_MD_3             7                   1           high
## 4   Brain_MD_4             7                   1            low
## 5   Brain_MD_5             7                   1            low
## 6   Brain_MD_6             9                   1           high
## 7   Brain_MD_7            14                   1           high
## 8   Brain_MD_8            16                   1            low
## 9   Brain_MD_9            18                   1           high
##                Array id Prognostic index Prediction
## Brain_MD_60 Brain_MD_60           -1.622        low
##     p-value % CV Support       ProbeSet Symbol 
## 1  9.52e-05          100 L17131_rna1_at  HMGA1 
## 2 0.0001127           90      D63880_at NCAPD2 
## 3 0.0002068           90      M73547_at  REEP5 
## 4 0.0002393           70    D28473_s_at   IARS 
## 5 0.0009122           20      S78296_at    INA 
## 6 0.0009407           40      U15008_at SNRPD2 
##                                                      Name EntrezID
##                             high mobility group AT-hook 1     3159
##                   non-SMC condensin I complex, subunit D2     9918
##                              receptor accessory protein 5     7905
##                                 isoleucyl-tRNA synthetase     3376
##  internexin neuronal intermediate filament protein, alpha     9118
##    small nuclear ribonucleoprotein D2 polypeptide 16.5kDa     6633
##                Loading 1 Loading 2 Correlation 1 Correlation 2
## L17131_rna1_at -0.053385 -0.046797     -0.894055     -0.321994
## D63880_at      -0.024108  0.002353     -0.590122      0.023660
## D28473_s_at    -0.010936 -0.007266     -0.282671     -0.077159
## M73547_at       0.024781 -0.038487      0.690807     -0.440791
## U15008_at      -0.018775 -0.040807     -0.497391     -0.444177
## S78296_at       0.027138 -0.085981      0.580856     -0.756109
##          Array Survival time Censoring indicator Predicted risk
## 1   Brain_MD_1            11                   1           high
## 2   Brain_MD_2             5                   1            low
## 3   Brain_MD_3             7                   1           high
## 4   Brain_MD_4             7                   1            low
## 5   Brain_MD_5             7                   1            low
## 6   Brain_MD_6             9                   1           high
## 7   Brain_MD_7            14                   1           high
## 8   Brain_MD_8            16                   1            low
## 9   Brain_MD_9            18                   1           high

Kaplan-Meier curves and time-dependent ROC curves can be plotted through plotKMCurve and plotROCCurveRisk, respectively.

library(survriskpred)
res <- test.survRiskPredict("Pomeroy",outputName = "SurvivalRiskPrediction")
## Starting the analysis using the full training data.
## Cross-validating the model.
## The number of folds in CV is  10 
## 1  2  3  4  5  6  7  8  9  10
## Complete risk group prediction.
## Attempting to create annotations ...
## Preparing output tables ...
## Attempting to create annotations ...
## Preparing output tables ...
## Attempting to create Master Annotations ...
plotKMCurve(res, 1)
plotKMCurve(res, 2)
plotKMCurve(res, 3)
plotROCCurveRisk(res)

Data Input

In this section, we provide an example to apply survRiskPredict over the same Pomeroy dataset we used in the previous section. There are a total of 60 patient samples with 1914 genes, among which 59 patients are used as training data and 1 patient as test data. The sex information is given as a clinical covariate. The following procedure used principal components method to select a gene list for 2-risk group prediction at the threshold significance level of 0.001.

data("Pomeroy")
projectPath <- tempdir()
outputName <- "SurvivalRiskPrediction"
generateHTML <- TRUE
resList2 <- survRiskPredict(exprTrain, covTrain, exprTest,covTest,
                           geneId, status, tme, geneSelect ="pc",
                           nriskgroups = 2, progIndexPerc = 50,
                           cvMethod = "10fold", nperm = 0, landmarktime = 0,
                           alpha = .001, ncomp = 2, mixing = 1, pcrgenes = 10,
                           projectPath = projectPath,
                           outputName = outputName, generateHTML)
if (generateHTML)
  browseURL(file.path(projectPath, "Output", outputName,
            paste0(outputName, ".html")))

Note that data(Pomeroy) loads a number of R objects as follows:

The resList2 list is the same as resList shown in the Quick Start Section. For more details about survRiskPredict, please type help("survRiskPredict") in the R console.