This package implements the survival risk group prediction analysis tool in BRB-ArrayTools (https://brb.nci.nih.gov/BRB-ArrayTools/
). The principal components or penalized Cox regression method is employed to select a gene list for fitting the Cox proportional harzards model. It provides an assessment of whether the association of expression data to survival data is statistically significant. It also lets the user evaluate whether the expression data provides more accurate predictions than that provided by standard clinical covariates. Kaplan-Meier curves are given for 2-/3-risk groups obtained by leave-one-our or 10-fold cross validation. The ROC curve at a specified landmark time using the Nearest Neighbor Estimation method will also be provided.
When no clinical covariates are provided, the model with gene expression only will be used. If any covariates are provided, models with covariates only and with both gene expression and covariates will be considered. In the model with gene expression only, for each gene the gene expression is used in the Cox proportional hazards model. Significant genes are selected for testing the hypothesis if the expression is a predictor of survival. They will be used to create ‘super genes’ or the principal components. The Cox proportional hazards model is performed again by using these principal components. In the model with covariates only, the covariates are used in the Cox proportional hazards model. No gene expression profile is used in the risk group prediction. In the model with both gene expression and covariates, significant genes are selected by the likelihood ration test by comparing the model containing covariates only and the model containing both gene expression and covariates. So the significant genes are selected by testing the hypothesis whether the expression data is a predictor of survival given the covariates. The significant genes are used to form the principal components. The Cox proportonal hazards model is used again by using these principal components. After fitting the Cox proportional hazards model, we can obtain the prognostic index (the exponential component in the Cox proportional hazards model) for each training samples. These prognostic indices will then be used in predicting risk groups for left-out or new samples.
To install the package from its binary version, you need to manually pre-install the glment, survivalROC and impute dependency packages by running the following script in R console:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("impute")
install.packages(c("glmnet","survivalROC"), repos = "https://cran.rstudio.com")
Afterwards, please install the survriskpred R package through the local installation. Click on “Packages” on the R menu bar, and select “install package(s) from local files”. Please browse for “survriskpred_0.2.zip” and click on “open”.
This package provides test.survRiskPredict
for a quick start of survival risk group prediction anaysis
over one of the built-in sample data (i.e., “Perou”, “Pomeroy”).
library(survriskpred)
res <- test.survRiskPredict("Pomeroy",outputName = "survivalRiskPrediction", generateHTML = FALSE)
## Starting the analysis using the full training data.
## Cross-validating the model.
## The number of folds in CV is 10
## 1 2 3 4 5 6 7 8 9 10
## Complete risk group prediction.
## Attempting to create annotations ...
## Preparing output tables ...
## Attempting to create annotations ...
## Preparing output tables ...
## Attempting to create Master Annotations ...
The list res
includes the following objects:
names(res)
## [1] "predictNewSamplesModE" "ppval"
## [3] "genesInClassifierModE" "loadingMatrixModE"
## [5] "predictRiskTrainingModE" "predictNewSamplesModCo"
## [7] "genesInClassifierModCo" "loadingMatrixModCo"
## [9] "predictRiskTrainingModCo" "workPath"
Here we give simple explanation about each object in res
:
res$predictNewSamplesModE
is a data frame with risk group predictions of new samples when the model with gene expression only is considered.res$predictNewSamplesModE
## Array id Prognostic index Prediction
## Brain_MD_60 Brain_MD_60 -1.611 low
res$genesInClassifierModE
is a data frame with genes selected by fitting penalized Cox proportional hazards model when the model with gene expression only is considered.res$genesInClassifierModE[1:6,]
## p-value % CV Support ProbeSet Symbol
## 1 9.52e-05 100 L17131_rna1_at HMGA1
## 2 0.0001127 100 D63880_at NCAPD2
## 3 0.0002068 90 M73547_at REEP5
## 4 0.0002393 40 D28473_s_at IARS
## 5 0.0009122 20 S78296_at INA
## 6 0.0009407 30 U15008_at SNRPD2
## Name EntrezID
## 1 high mobility group AT-hook 1 3159
## 2 non-SMC condensin I complex, subunit D2 9918
## 3 receptor accessory protein 5 7905
## 4 isoleucyl-tRNA synthetase 3376
## 5 internexin neuronal intermediate filament protein, alpha 9118
## 6 small nuclear ribonucleoprotein D2 polypeptide 16.5kDa 6633
res$loadingMatrixModE
is a data frame with loading matrix of the significant genes and the correlations between the principal components and the signficant genes when the model with gene expression only is considered.res$loadingMatrixModE[1:6,]
## Loading 1 Loading 2 Correlation 1 Correlation 2 Weights
## L17131_rna1_at -0.053385 -0.046797 -0.894055 -0.321994 0.757902
## D63880_at -0.024108 0.002353 -0.590122 0.023660 0.359988
## M73547_at 0.024781 -0.038487 0.690807 -0.440791 -0.397288
## D28473_s_at -0.010936 -0.007266 -0.282671 -0.077159 0.157008
## S78296_at 0.027138 -0.085981 0.580856 -0.756109 -0.468173
## U15008_at -0.018775 -0.040807 -0.497391 -0.444177 0.248152
res$predictRiskTrainingModE
is a data frame with a table of arrays, survival time, censoring indicator and survival risk prediction of training samples when the model with gene expression only is considered.res$predictRiskTrainingModE[1:9,]
## Array Survival time Censoring indicator Predicted risk
## 1 Brain_MD_1 11 1 high
## 2 Brain_MD_2 5 1 low
## 3 Brain_MD_3 7 1 high
## 4 Brain_MD_4 7 1 low
## 5 Brain_MD_5 7 1 low
## 6 Brain_MD_6 9 1 high
## 7 Brain_MD_7 14 1 low
## 8 Brain_MD_8 16 1 low
## 9 Brain_MD_9 18 1 high
res$predictNewSamplesModCo
is a data frame with risk group predictions of new samples when the model with both covariates and gene expression is considered.res$predictNewSamplesModCo
## Array id Prognostic index Prediction
## Brain_MD_60 Brain_MD_60 -1.622 low
res$genesInClassifierModCo
is a data frame with genes selected by fitting penalized Cox proportional hazards model when the model with both covariates and gene expression is considered.res$genesInClassifierModCo[1:6,]
## p-value % CV Support ProbeSet Symbol
## 1 9.43e-05 100 L17131_rna1_at HMGA1
## 2 0.0001134 100 D63880_at NCAPD2
## 3 0.0001407 70 D28473_s_at IARS
## 4 0.0001842 90 M73547_at REEP5
## 5 0.0008418 30 U15008_at SNRPD2
## 6 0.0008425 20 S78296_at INA
## Name EntrezID
## 1 high mobility group AT-hook 1 3159
## 2 non-SMC condensin I complex, subunit D2 9918
## 3 isoleucyl-tRNA synthetase 3376
## 4 receptor accessory protein 5 7905
## 5 small nuclear ribonucleoprotein D2 polypeptide 16.5kDa 6633
## 6 internexin neuronal intermediate filament protein, alpha 9118
res$loadingMatrixModCo
is a data frame with loading matrix of the significant genes and the correlations between the principal components and the signficant genes when the model with both covariates and gene expression is considered.res$loadingMatrixModCo[1:6,]
## Loading 1 Loading 2 Correlation 1 Correlation 2
## L17131_rna1_at -0.053385 -0.046797 -0.894055 -0.321994
## D63880_at -0.024108 0.002353 -0.590122 0.023660
## D28473_s_at -0.010936 -0.007266 -0.282671 -0.077159
## M73547_at 0.024781 -0.038487 0.690807 -0.440791
## U15008_at -0.018775 -0.040807 -0.497391 -0.444177
## S78296_at 0.027138 -0.085981 0.580856 -0.756109
res$predictRiskTrainingModCo
is a data frame with the table of arrays, survival time, censoring indicator and survival risk prediction of training samples when the model with both covariates and gene expression is considered.res$predictRiskTrainingModCo[1:9,]
## Array Survival time Censoring indicator Predicted risk
## 1 Brain_MD_1 11 1 high
## 2 Brain_MD_2 5 1 low
## 3 Brain_MD_3 7 1 high
## 4 Brain_MD_4 7 1 low
## 5 Brain_MD_5 7 1 low
## 6 Brain_MD_6 9 1 high
## 7 Brain_MD_7 14 1 high
## 8 Brain_MD_8 16 1 low
## 9 Brain_MD_9 18 1 high
res$workPath
specifies the path for fortran and other intermediate outputs.Kaplan-Meier curves and time-dependent ROC curves can be plotted through plotKMCurve
and plotROCCurveRisk
, respectively.
library(survriskpred)
res <- test.survRiskPredict("Pomeroy",outputName = "SurvivalRiskPrediction")
## Starting the analysis using the full training data.
## Cross-validating the model.
## The number of folds in CV is 10
## 1 2 3 4 5 6 7 8 9 10
## Complete risk group prediction.
## Attempting to create annotations ...
## Preparing output tables ...
## Attempting to create annotations ...
## Preparing output tables ...
## Attempting to create Master Annotations ...
plotKMCurve(res, 1)
plotKMCurve(res, 2)
plotKMCurve(res, 3)
plotROCCurveRisk(res)
When the argument generateHTML
is set as TRUE
, an HTML file called survivalRiskPrediction.html will be created under C:\Users\YourUserName\Documents\Pomeroy\Output\SurvivalRiskPrediction
.
In this section, we provide an example to apply survRiskPredict
over the same Pomeroy dataset we used in the previous section. There are a total of 60 patient samples with 1914 genes, among which 59 patients are used as training data and 1 patient as test data. The sex information is given as a clinical covariate. The following procedure used principal components method to select a gene list for 2-risk group prediction at the threshold significance level of 0.001.
data("Pomeroy")
projectPath <- file.path(Sys.getenv("HOME"),"Pomeroy")
outputName <- "SurvivalRiskPrediction"
generateHTML <- TRUE
resList2 <- survRiskPredict(exprTrain, covTrain, exprTest,covTest,
geneId, status, tme, geneSelect ="pc",
nriskgroups = 2, progIndexPerc = 50,
cvMethod = "10fold", nperm = 0, landmarktime = 0,
alpha = .001, ncomp = 2, mixing = 1, pcrgenes = 10,
projectPath = projectPath,
outputName = outputName, generateHTML)
if (generateHTML)
browseURL(file.path(projectPath, "Output", outputName,
paste0(outputName, ".html")))
Note that data(Pomeroy)
loads a number of R objects as follows:
exprTrain
: a data frame of gene expression data for 59 training samples with 1914 genes. Rows are genes and columns are samples.
covTrain
: a data frame of a clinical covariate (i.e., sex) from 59 training samples. Rows are samples and column is the sex covariate.
exprTest
: a data frame of gene expression data for 1 new sample with 1914 genes.
covTest
: a data frame of a clinical covariate (i.e., sex) from 1 new sample. Rows are samples and column is the sex covariate.
geneId
: a data frame with 1914 gene IDs in rows and annotations (e.g., gene symbol) in columns.
tme
: a vector speciying survival time. It is ordered in the same order of training samples followed by new samples.
status
: a vector specifying survival status (1 = death, 0 = censored). It is ordered in the same order of training samples followed by new samples.
The resList2
list is the same as resList
shown in the Quick Start Section. For more details about survRiskPredict
, please type help("survRiskPredict")
in the R console.