1 Introduction

This package implements the survival risk group prediction analysis tool in BRB-ArrayTools (https://brb.nci.nih.gov/BRB-ArrayTools/). The principal components or penalized Cox regression method is employed to select a gene list for fitting the Cox proportional harzards model. It provides an assessment of whether the association of expression data to survival data is statistically significant. It also lets the user evaluate whether the expression data provides more accurate predictions than that provided by standard clinical covariates. Kaplan-Meier curves are given for 2-/3-risk groups obtained by leave-one-our or 10-fold cross validation. The ROC curve at a specified landmark time using the Nearest Neighbor Estimation method will also be provided.

When no clinical covariates are provided, the model with gene expression only will be used. If any covariates are provided, models with covariates only and with both gene expression and covariates will be considered. In the model with gene expression only, for each gene the gene expression is used in the Cox proportional hazards model. Significant genes are selected for testing the hypothesis if the expression is a predictor of survival. They will be used to create ‘super genes’ or the principal components. The Cox proportional hazards model is performed again by using these principal components. In the model with covariates only, the covariates are used in the Cox proportional hazards model. No gene expression profile is used in the risk group prediction. In the model with both gene expression and covariates, significant genes are selected by the likelihood ration test by comparing the model containing covariates only and the model containing both gene expression and covariates. So the significant genes are selected by testing the hypothesis whether the expression data is a predictor of survival given the covariates. The significant genes are used to form the principal components. The Cox proportonal hazards model is used again by using these principal components. After fitting the Cox proportional hazards model, we can obtain the prognostic index (the exponential component in the Cox proportional hazards model) for each training samples. These prognostic indices will then be used in predicting risk groups for left-out or new samples.

2 Installation

To install the package from its binary version, you need to manually pre-install the glment, survivalROC and impute dependency packages by running the following script in R console:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("impute")
install.packages(c("glmnet","survivalROC"), repos = "https://cran.rstudio.com")

Afterwards, please install the survriskpred R package through the local installation. Click on “Packages” on the R menu bar, and select “install package(s) from local files”. Please browse for “survriskpred_0.2.zip” and click on “open”.

3 Quick Start

This package provides test.survRiskPredict for a quick start of survival risk group prediction anaysis
over one of the built-in sample data (i.e., “Perou”, “Pomeroy”).

library(survriskpred)
res <- test.survRiskPredict("Pomeroy",outputName = "survivalRiskPrediction", generateHTML = FALSE)
## Starting the analysis using the full training data.
## Cross-validating the model.
## The number of folds in CV is  10 
## 1  2  3  4  5  6  7  8  9  10
## Complete risk group prediction.
## Attempting to create annotations ...
## Preparing output tables ...
## Attempting to create annotations ...
## Preparing output tables ...
## Attempting to create Master Annotations ...

The list res includes the following objects:

names(res) 
##  [1] "predictNewSamplesModE"    "ppval"                   
##  [3] "genesInClassifierModE"    "loadingMatrixModE"       
##  [5] "predictRiskTrainingModE"  "predictNewSamplesModCo"  
##  [7] "genesInClassifierModCo"   "loadingMatrixModCo"      
##  [9] "predictRiskTrainingModCo" "workPath"

Here we give simple explanation about each object in res:

  • res$predictNewSamplesModE is a data frame with risk group predictions of new samples when the model with gene expression only is considered.
res$predictNewSamplesModE
##                Array id Prognostic index Prediction
## Brain_MD_60 Brain_MD_60           -1.611        low
  • res$genesInClassifierModE is a data frame with genes selected by fitting penalized Cox proportional hazards model when the model with gene expression only is considered.
res$genesInClassifierModE[1:6,]
##     p-value % CV Support       ProbeSet Symbol
## 1  9.52e-05          100 L17131_rna1_at  HMGA1
## 2 0.0001127          100      D63880_at NCAPD2
## 3 0.0002068           90      M73547_at  REEP5
## 4 0.0002393           40    D28473_s_at   IARS
## 5 0.0009122           20      S78296_at    INA
## 6 0.0009407           30      U15008_at SNRPD2
##                                                       Name EntrezID
## 1                            high mobility group AT-hook 1     3159
## 2                  non-SMC condensin I complex, subunit D2     9918
## 3                             receptor accessory protein 5     7905
## 4                                isoleucyl-tRNA synthetase     3376
## 5 internexin neuronal intermediate filament protein, alpha     9118
## 6   small nuclear ribonucleoprotein D2 polypeptide 16.5kDa     6633
  • res$loadingMatrixModE is a data frame with loading matrix of the significant genes and the correlations between the principal components and the signficant genes when the model with gene expression only is considered.
res$loadingMatrixModE[1:6,]
##                Loading 1 Loading 2 Correlation 1 Correlation 2   Weights
## L17131_rna1_at -0.053385 -0.046797     -0.894055     -0.321994  0.757902
## D63880_at      -0.024108  0.002353     -0.590122      0.023660  0.359988
## M73547_at       0.024781 -0.038487      0.690807     -0.440791 -0.397288
## D28473_s_at    -0.010936 -0.007266     -0.282671     -0.077159  0.157008
## S78296_at       0.027138 -0.085981      0.580856     -0.756109 -0.468173
## U15008_at      -0.018775 -0.040807     -0.497391     -0.444177  0.248152
  • res$predictRiskTrainingModE is a data frame with a table of arrays, survival time, censoring indicator and survival risk prediction of training samples when the model with gene expression only is considered.
res$predictRiskTrainingModE[1:9,]
##        Array Survival time Censoring indicator Predicted risk
## 1 Brain_MD_1            11                   1           high
## 2 Brain_MD_2             5                   1            low
## 3 Brain_MD_3             7                   1           high
## 4 Brain_MD_4             7                   1            low
## 5 Brain_MD_5             7                   1            low
## 6 Brain_MD_6             9                   1           high
## 7 Brain_MD_7            14                   1            low
## 8 Brain_MD_8            16                   1            low
## 9 Brain_MD_9            18                   1           high
  • res$predictNewSamplesModCo is a data frame with risk group predictions of new samples when the model with both covariates and gene expression is considered.
res$predictNewSamplesModCo
##                Array id Prognostic index Prediction
## Brain_MD_60 Brain_MD_60           -1.622        low
  • res$genesInClassifierModCo is a data frame with genes selected by fitting penalized Cox proportional hazards model when the model with both covariates and gene expression is considered.
res$genesInClassifierModCo[1:6,]
##     p-value % CV Support       ProbeSet Symbol
## 1  9.43e-05          100 L17131_rna1_at  HMGA1
## 2 0.0001134          100      D63880_at NCAPD2
## 3 0.0001407           70    D28473_s_at   IARS
## 4 0.0001842           90      M73547_at  REEP5
## 5 0.0008418           30      U15008_at SNRPD2
## 6 0.0008425           20      S78296_at    INA
##                                                       Name EntrezID
## 1                            high mobility group AT-hook 1     3159
## 2                  non-SMC condensin I complex, subunit D2     9918
## 3                                isoleucyl-tRNA synthetase     3376
## 4                             receptor accessory protein 5     7905
## 5   small nuclear ribonucleoprotein D2 polypeptide 16.5kDa     6633
## 6 internexin neuronal intermediate filament protein, alpha     9118
  • res$loadingMatrixModCo is a data frame with loading matrix of the significant genes and the correlations between the principal components and the signficant genes when the model with both covariates and gene expression is considered.
res$loadingMatrixModCo[1:6,]
##                Loading 1 Loading 2 Correlation 1 Correlation 2
## L17131_rna1_at -0.053385 -0.046797     -0.894055     -0.321994
## D63880_at      -0.024108  0.002353     -0.590122      0.023660
## D28473_s_at    -0.010936 -0.007266     -0.282671     -0.077159
## M73547_at       0.024781 -0.038487      0.690807     -0.440791
## U15008_at      -0.018775 -0.040807     -0.497391     -0.444177
## S78296_at       0.027138 -0.085981      0.580856     -0.756109
  • res$predictRiskTrainingModCo is a data frame with the table of arrays, survival time, censoring indicator and survival risk prediction of training samples when the model with both covariates and gene expression is considered.
res$predictRiskTrainingModCo[1:9,]
##        Array Survival time Censoring indicator Predicted risk
## 1 Brain_MD_1            11                   1           high
## 2 Brain_MD_2             5                   1            low
## 3 Brain_MD_3             7                   1           high
## 4 Brain_MD_4             7                   1            low
## 5 Brain_MD_5             7                   1            low
## 6 Brain_MD_6             9                   1           high
## 7 Brain_MD_7            14                   1           high
## 8 Brain_MD_8            16                   1            low
## 9 Brain_MD_9            18                   1           high
  • res$workPath specifies the path for fortran and other intermediate outputs.

Kaplan-Meier curves and time-dependent ROC curves can be plotted through plotKMCurve and plotROCCurveRisk, respectively.

library(survriskpred)
res <- test.survRiskPredict("Pomeroy",outputName = "SurvivalRiskPrediction")
## Starting the analysis using the full training data.
## Cross-validating the model.
## The number of folds in CV is  10 
## 1  2  3  4  5  6  7  8  9  10
## Complete risk group prediction.
## Attempting to create annotations ...
## Preparing output tables ...
## Attempting to create annotations ...
## Preparing output tables ...
## Attempting to create Master Annotations ...
plotKMCurve(res, 1)
plotKMCurve(res, 2)
plotKMCurve(res, 3)
plotROCCurveRisk(res)

When the argument generateHTML is set as TRUE, an HTML file called survivalRiskPrediction.html will be created under C:\Users\YourUserName\Documents\Pomeroy\Output\SurvivalRiskPrediction.

4 Data Input

In this section, we provide an example to apply survRiskPredict over the same Pomeroy dataset we used in the previous section. There are a total of 60 patient samples with 1914 genes, among which 59 patients are used as training data and 1 patient as test data. The sex information is given as a clinical covariate. The following procedure used principal components method to select a gene list for 2-risk group prediction at the threshold significance level of 0.001.

data("Pomeroy")
projectPath <- file.path(Sys.getenv("HOME"),"Pomeroy")
outputName <- "SurvivalRiskPrediction"
generateHTML <- TRUE
resList2 <- survRiskPredict(exprTrain, covTrain, exprTest,covTest,
                           geneId, status, tme, geneSelect ="pc",
                           nriskgroups = 2, progIndexPerc = 50,
                           cvMethod = "10fold", nperm = 0, landmarktime = 0,
                           alpha = .001, ncomp = 2, mixing = 1, pcrgenes = 10,
                           projectPath = projectPath,
                           outputName = outputName, generateHTML)
if (generateHTML)
  browseURL(file.path(projectPath, "Output", outputName,
            paste0(outputName, ".html")))

Note that data(Pomeroy) loads a number of R objects as follows:

  • exprTrain: a data frame of gene expression data for 59 training samples with 1914 genes. Rows are genes and columns are samples.

  • covTrain: a data frame of a clinical covariate (i.e., sex) from 59 training samples. Rows are samples and column is the sex covariate.

  • exprTest: a data frame of gene expression data for 1 new sample with 1914 genes.

  • covTest: a data frame of a clinical covariate (i.e., sex) from 1 new sample. Rows are samples and column is the sex covariate.

  • geneId: a data frame with 1914 gene IDs in rows and annotations (e.g., gene symbol) in columns.

  • tme: a vector speciying survival time. It is ordered in the same order of training samples followed by new samples.

  • status: a vector specifying survival status (1 = death, 0 = censored). It is ordered in the same order of training samples followed by new samples.

The resList2 list is the same as resList shown in the Quick Start Section. For more details about survRiskPredict, please type help("survRiskPredict") in the R console.