This package creates a multivariate predictor for determining to which of multiple classes a given sample belongs. Several multivariate classification methods are available, including the Compound Covariate Predictor, Diagonal Linear Discriminant Analysis, Nearest Neighbor Predictor, Nearest Centroid Predictor, and Support Vector Machine Predictor. For all class prediction methods requested, this package provides an estimate of how accurately the classes can be predicted by this multivariate class predictor. The whole procedure is evaluated by the cross-validation methods including leave-one-out cross-validation, k-fold validation and 0.632+ bootstrap validation. The cross-validated estimate of misclassification rate is computed and performance of each classifier is provided. New samples can be further classified based on specified classifiers and the multivariate predictor from full dataset.
To install the package from its binary version, you need to manually pre-install the ROC dependency package by running the following script in R console:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ROC")
Afterwards, please install the classpredict R package through the local installation. Click on “Packages” on the R menu bar, and select “install package(s) from local files”. Please browse for “classpredict_0.2.zip” and click on “open”.
This package provides test.classPrediction
for a quick start of class prediction analysis over one of the built-in sample data (i.e., “Brca”, “Perou”, and “Pomeroy”).
library(classpredict)
res <- test.classPredict("Brca",outputName = "ClassPrediction", generateHTML = FALSE)
## Getting analysis results ...
The list res
includes the following objects:
names(res)
## [1] "performClass" "percentCorrectClass" "predNewSamples"
## [4] "classifierTable" "probInClass" "CCPSenSpec"
## [7] "LDASenSpec" "K1NNSenSpec" "K3NNSenSpec"
## [10] "CentroidSenSpec" "SVMSenSpec" "BCPPSenSpec"
## [13] "probNew" "weightLinearPred" "thresholdLinearPred"
## [16] "GRPCentroid" "pmethod" "workPath"
Here we give simple explanation about each object in res
:
res$performClass
is a data frame with the performance of classifiers during cross-validation:res$performClass[1:11,]
## Array id Class label Mean Number of genes in classifier CCP Correct?
## 1 s1996 BRCA1 16 YES
## 2 s1822 BRCA1 20 YES
## 3 s1714 BRCA1 28 YES
## 4 s1224 BRCA1 15 YES
## 5 s1252 BRCA1 28 YES
## 6 s1510 BRCA1 20 YES
## 7 s1905 BRCA1 20 YES
## 8 s1900 BRCA2 13 YES
## 9 s1787 BRCA2 17 YES
## 10 s1721 BRCA2 10 YES
## 11 s1486 BRCA2 17 NO
## DLDA Correct? 1NN Correct? 3NN Correct? Nearest Centroid Correct?
## 1 YES YES YES YES
## 2 YES YES YES YES
## 3 YES YES YES YES
## 4 YES YES YES YES
## 5 NO YES NO YES
## 6 YES YES YES YES
## 7 YES YES YES YES
## 8 YES YES NO NO
## 9 YES YES YES YES
## 10 YES YES YES YES
## 11 NO YES NO NO
## SVM Correct? BCCP Correct?
## 1 YES YES
## 2 YES YES
## 3 YES YES
## 4 YES YES
## 5 YES YES
## 6 YES YES
## 7 YES YES
## 8 YES YES
## 9 YES YES
## 10 YES YES
## 11 NO NO
res$percentCorrectClass
is a data frame with the mean percent of correct classification for each sample using different prediction methods.res$percentCorrectClass
## CCP Correct? DLDA Correct? 1NN Correct? 3NN Correct?
## 1 91 82 100 73
## Nearest Centroid Correct? SVM Correct? BCCP Correct?
## 1 82 91 91
res$predNewSamples
is a data frame with predicted class for each new sample. NC
means that a sample is not classified. In this example, there are four new samples.res$predNewSamples[1:4,]
## ExpID TrueClass CCP LDA K1 K3 Centroid SVM BCCP
## 1 s1816 predict BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2
## 2 s1616 predict BRCA2 BRCA1 BRCA2 BRCA1 BRCA2 BRCA2 NC
## 3 s1063 predict BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1
## 4 s1936 predict BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2
res$probNew
is a data frame with the predicted probability of each new sample belonging to the class (BRCA1) from the the Bayesian Compound Covariate method.res$probNew[1:4,]
## Array id Class Probability
## 1 s1816 BRCA1 p < 1.0e-3
## 2 s1616 BRCA1 0.344
## 3 s1063 BRCA1 1
## 4 s1936 BRCA1 p < 1.0e-3
res$classifierTable
is a data frame with composition of classifiers such as geometric means of values in each class, p-values and Gene IDs.
res$probInClass
is a data frame with predicted probability of each training sample belonging to aclass during cross-validation from the Bayesian Compound Covariate
res$CCPSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Compound Covariate Predictor Classifier.
res$LDASenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Diagonal Linear Discriminant Analysis Classifier.
res$K1NNSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the 1-Nearest Neighbor Classifier.
res$K3NNSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the 3-Nearest Neighbor Classifier.
res$CentroidSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Nearest Centroid Classifier.
res$SVMSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Support Vector Machine Classifier.
res$BCPPSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Bayesian Compound Covariate Classifier.
res$weightLinearPred
is a data frame with gene weights for linear predictors such as Compound Covariate Predictor, Diagonal Linear Discriminant Analysis and Support Vector Machine.
res$thresholdLinearPred
contains the thresholds for the linear prediction rules related with res$weightLinearPred
. Each prediction rule is defined by the inner sum of the weights (\(w_i\)) and log expression values (\(x_i\)) of significant genes. In this case, a sample is classified to the class BRCA1 if the sum is greater than the threshold; that is, \(\sum_i w_i x_i > threshold\).
res$GRPCentroid
is a data frame with centroid of each class for each predictor gene.
res$pmethod
is a vector of prediction methods that are specified.
res$workPath
is the path for Fortran and other intermediate outputs.
Cross-validation ROC curves are provided for Compound Covariate Predictor, Diagonal Linear Discriminant Analysis and Bayesian Compound Covariate Classifiers.
library(classpredict)
res <- test.classPredict("Brca",outputName = "ClassPrediction")
## Getting analysis results ...
plotROCCurve(res,"ccp")
plotROCCurve(res,"dlda")
plotROCCurve(res,"bcc")
When the argument generateHTML
is set to be TRUE
, an HTML file called ClassPrediction.html will be created under C:\Users\YourUserName\Documents\Brca\Output\ClassPrediction
.
classPredict
is the main R function to perform class prediction analysis. In this section, we will look into details about how to prepare inputs for classPredict
. Once again, we use the “Brca” sample data for an example. The package contains the following “Brca” sample information:
*Brca_LOGRAT.txt : a table of expression data with rows representing genes and columns representing samples;
*Brca_FILTER.TXT: a list of filtering information, where 1 means the corresponding gene passes the filters while 0 means it is excluded from analysis;
*Brca_GENEID.txt: a table of gene information corresponding to row information of Brca_LOGRAT.txt and Brca_FILTER.TXT;
*Brca_EXPDESIGN.txt: a table with class information AND/OR separate test set information.
There are a total of 15 samples, where 11 samples will used as training data and the remaining are new samples for class prediction. We run the following code to obtain objects like exprTrain
and exprTest
as inputs to classPredict
.
dataset<-"Brca"
# gene IDs
geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"), package = "classpredict"), as.is = TRUE, colClasses = "character")
# expression data
x <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"), package = "classpredict"), header = FALSE)
# filter information, 1 - pass the filter, 0 - filtered
filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"), package = "classpredict"), quiet = TRUE)
# class information
expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"), package = "classpredict"), as.is = TRUE)
# training/test information
testSet <- expdesign[, 10]
trainingInd <- which(testSet == "training")
predictInd <- which(testSet == "predict")
ind1 <- which(expdesign[trainingInd, 4] == "BRCA1")
ind2 <- which(expdesign[trainingInd, 4] == "BRCA2")
ind <- c(ind1, ind2)
exprTrain <- x[, ind]
colnames(exprTrain) <- expdesign[ind, 1]
exprTest <- x[, predictInd]
colnames(exprTest) <- expdesign[predictInd, 1]
exprTrain
is a 3226*11 matrix with rows representing genes and columns representing 11 training samples.
exprTrain[1:5,]
## s1996 s1822 s1714 s1224 s1252 s1510
## 1 -3.0817938 -2.73039293 -1.8744690 -2.28824496 -0.3453870 -1.4232113
## 2 0.2781018 -0.20113993 -0.5334322 -0.57929373 -0.2874397 -0.8826430
## 3 0.4375801 0.10479617 0.9533499 -0.22050031 0.3532323 -0.6731896
## 4 -0.8389376 -0.23562828 0.6195197 0.81221521 -0.4181434 -0.5250910
## 5 -0.4340958 0.06756324 0.7655347 -0.09386685 -0.4181434 0.3841435
## s1905 s1900 s1787 s1721 s1486
## 1 -1.6828099 -1.7776077 -0.2410080 -0.29195589 0.24146917
## 2 -1.0000000 -0.4150376 -1.0223678 -0.74802077 -1.16699564
## 3 0.9940752 0.5109619 -0.1643868 0.02185956 0.24146917
## 4 0.7697023 0.2630344 0.6429682 1.45843005 -0.04146478
## 5 -0.2725259 -0.1926452 -0.5145731 -0.62403196 -0.01761806
exprTest
is a 3226*4 matrix with the expressions of four new samples.
exprTest[1:5,]
## s1816 s1616 s1063 s1936
## 1 -0.8214026 -0.5618789 -0.4611339 -0.93288577
## 2 -0.8614801 -1.6322682 -0.7737241 -0.33342373
## 3 0.4066253 0.4381211 0.4116309 1.25153875
## 4 1.3286228 1.3737305 0.5574818 1.02272010
## 5 1.3330686 1.2422009 0.0402640 0.03394729
The following procedure develops from all samples seven classifiers which are used to predict classes of new samples. Individual genes that are used by classifiers are selected at the 0.001 significance level. Random variance model will be used for univariate tests. The leave-one-out cross-validation method is employed to evaluate class prediction accuracy by selecting predictors and training classifiers from cross-validated traning set and calculating the cross-validated estimate of misclassification error over the cross-validated test set. Equal prior probabilities are assumed for the Bayesian Compound Covariate Predictor.
projectPath <- file.path(Sys.getenv("HOME"),"Brca")
outputName <- "classPrediction2"
generateHTML <- TRUE
prevalence <- c(length(ind1)/(length(ind1)+length(ind2)),length(ind2)/(length(ind1)+length(ind2)))
names(prevalence) <- c("BRCA1", "BRCA2")
resList <- classPredict(exprTrain = exprTrain, exprTest = exprTest, isPaired = FALSE,
pairVar.train = NULL, pairVar.test = NULL, geneId,
cls = c(rep("BRCA1", length(ind1)), rep("BRCA2", length(ind2))),
pmethod = c("ccp", "bcc", "dlda", "knn", "nc", "svm"),
geneSelect = "igenes.univAlpha",
univAlpha = 0.001, univMcr = 0, foldDiff = 0, rvm = TRUE, filter = filter,
ngenePairs = 25, nfrvm = 10, cvMethod = 1, kfoldValue = 10, bccPrior = 1,
bccThresh = 0.8, nperm = 0, svmCost = 1, svmWeight =1, fixseed = 1,
prevalence = prevalence, projectPath = projectPath,
outputName = outputName, generateHTML = generateHTML)
if (generateHTML)
browseURL(file.path(projectPath, "Output", outputName,
paste0(outputName, ".html")))
It returns the same list as shown in the Quick Start Section. For more details about classPredict
, please type help("classPredict")
in the R console.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] classpredict_0.2
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.0 magrittr_1.5 htmltools_0.3.6 tools_3.6.0
## [5] yaml_2.2.0 Rcpp_1.0.1 stringi_1.4.3 rmarkdown_1.13
## [9] knitr_1.23 stringr_1.4.0 digest_0.6.19 xfun_0.8
## [13] ROC_1.60.0 evaluate_0.14