This package creates a multivariate predictor for determining to which of multiple classes a given sample belongs. Several multivariate classification methods are available, including the Compound Covariate Predictor, Diagonal Linear Discriminant Analysis, Nearest Neighbor Predictor, Nearest Centroid Predictor, and Support Vector Machine Predictor. For all class prediction methods requested, this package provides an estimate of how accurately the classes can be predicted by this multivariate class predictor. The whole procedure is evaluated by the cross-validation methods including leave-one-out cross-validation, k-fold validation and 0.632+ bootstrap validation. The cross-validated estimate of misclassification rate is computed and performance of each classifier is provided. New samples can be further classified based on specified classifiers and the multivariate predictor from full dataset.
This package provides test.classPrediction
for a quick start of class prediction analysis
over one of the built-in sample data (i.e., “Brca”, “Perou”, and “Pomeroy”).
library(classpredict)
res <- test.classPredict("Brca",outputName = "ClassPrediction", generateHTML = TRUE)
names(res)
It outputs an HTML file (C:\Users\YourUserName\Documents\Brca\Output\ClassPrediction\ClassPrediction.html
) with class prediction results as well as a list res
including the following objects:
## [1] "performClass" "percentCorrectClass" "predNewSamples" "classifierTable"
## [5] "probInClass" "CCPSenSpec" "LDASenSpec" "K1NNSenSpec"
## [9] "K3NNSenSpec" "CentroidSenSpec" "SVMSenSpec" "BCPPSenSpec"
##[13] "probNew" "weightLinearPred" "thresholdLinearPred" "GRPCentroid"
##[17] "pmethod" "workPath"
Here we give simple explanation about each object in res
:
res$performClass
is a data frame with the performance of classifiers during cross-validation:## Array id Class label Mean Number of genes in classifier CCP Correct? DLDA Correct? 1NN Correct?
## 1 s1996 BRCA1 16 YES YES YES
## 2 s1822 BRCA1 20 YES YES YES
## 3 s1714 BRCA1 28 YES YES YES
## 4 s1224 BRCA1 15 YES YES YES
## 5 s1252 BRCA1 28 YES NO YES
## 6 s1510 BRCA1 20 YES YES YES
## 7 s1905 BRCA1 20 YES YES YES
## 8 s1900 BRCA2 13 YES YES YES
## 9 s1787 BRCA2 17 YES YES YES
## 10 s1721 BRCA2 10 YES YES YES
## 11 s1486 BRCA2 17 NO NO YES
## 3NN Correct? Nearest Centroid Correct? SVM Correct? BCCP Correct?
## 1 YES YES YES YES
## 2 YES YES YES YES
## 3 YES YES YES YES
## 4 YES YES YES YES
## 5 NO YES YES YES
## 6 YES YES YES YES
## 7 YES YES YES YES
## 8 NO NO YES YES
## 9 YES YES YES YES
## 10 YES YES YES YES
## 11 NO NO NO NO
res$percentCorrectClass
is a data frame with the mean percent of correct classification for each sample using
different prediction methods. ## CCP Correct? DLDA Correct? 1NN Correct? 3NN Correct? Nearest Centroid Correct? SVM Correct? BCCP Correct?
## 1 91 82 100 73 82 91 91
res$predNewSamples
is a data frame with predicted class for each
new sample. NC
means that a sample is not classified. In this example, there are four new samples.## ExpID TrueClass CCP LDA K1 K3 Centroid SVM BCCP
## 1 s1816 predict BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2
## 2 s1616 predict BRCA2 BRCA1 BRCA2 BRCA1 BRCA2 BRCA2 NC
## 3 s1063 predict BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1 BRCA1
## 4 s1936 predict BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2 BRCA2
res$probNew
is a data frame with the predicted probability of each new sample belonging to the class (BRCA1) from the the Bayesian Compound Covariate method. Array id Class Probability
1 s1816 BRCA1 p < 1.0e-3
2 s1616 BRCA1 0.344
3 s1063 BRCA1 1
4 s1936 BRCA1 p < 1.0e-3
res$classifierTable
is a data frame with composition of classifiers such as geometric means of values in each class, p-values and Gene IDs.
res$probInClass
is a data frame with predicted probability of each training sample belonging to
aclass during cross-validation from the Bayesian Compound Covariate
res$CCPSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value,
negative prediction value) of the Compound Covariate Predictor Classifier.
res$LDASenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value,
negative prediction value) of the Diagonal Linear Discriminant Analysis Classifier.
res$K1NNSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value,
negative prediction value) of the 1-Nearest Neighbor Classifier.
res$K3NNSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value,
negative prediction value) of the 3-Nearest Neighbor Classifier.
res$CentroidSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value,
negative prediction value) of the Nearest Centroid Classifier.
res$SVMSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value,
negative prediction value) of the Support Vector Machine Classifier.
res$BCPPSenSpec
is a data frame with performance (i.e., sensitivity, specificity, positive prediction value,
negative prediction value) of the Bayesian Compound Covariate Classifier.
res$weightLinearPred
is a data frame with gene weights for linear predictors such as Compound Covariate Predictor,
Diagonal Linear Discriminant Analysis and Support Vector Machine.
res$thresholdLinearPred
contains the thresholds for the linear prediction rules related with res$weightLinearPred
. Each prediction rule is defined by the inner sum of the weights (\(w_i\))
and log expression values (\(x_i\)) of significant genes.
In this case, a sample is classified to the class BRCA1 if
the sum is greater than the threshold; that is, \(\sum_i w_i x_i > threshold\).
res$GRPCentroid
is a data frame with centroid of each class for each predictor gene.
res$pmethod
is a vector of prediction methods that are specified.
res$workPath
is the path for Fortran and other intermediate outputs.
Cross-validation ROC curves are provided for Compound Covariate Predictor, Diagonal Linear Discriminant Analysis and Bayesian Compound Covariate Classifiers.
library(classpredict)
res <- test.classPredict("Brca",outputName = "ClassPrediction")
## Getting analysis results ...
plotROCCurve(res,"ccp")
plotROCCurve(res,"dlda")
plotROCCurve(res,"bcc")
classPredict
is the main R function to perform class prediction analysis. In this section, we will
look into details about how to prepare inputs for classPredict
. Once again, we use the “Brca” sample
data for an example. The package contains the following “Brca” sample information:
*Brca_LOGRAT.txt : a table of expression data with rows representing genes and columns representing samples;
*Brca_FILTER.TXT: a list of filtering information, where 1 means the corresponding gene passes the filters while 0 means it is excluded from analysis;
*Brca_GENEID.txt: a table of gene information corresponding to row information of Brca_LOGRAT.txt and Brca_FILTER.TXT;
*Brca_EXPDESIGN.txt: a table with class information AND/OR separate test set information.
There are a total of 15 samples, where 11 samples will used as training data and the remaining are new samples for
class prediction. We run the following code to obtain objects like exprTrain
and exprTest
as inputs to
classPredict
.
dataset<-"Brca"
# gene IDs
geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"), package = "classpredict"), as.is = TRUE, colClasses = "character")
# expression data
x <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"), package = "classpredict"), header = FALSE)
# filter information, 1 - pass the filter, 0 - filtered
filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"), package = "classpredict"), quiet = TRUE)
# class information
expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"), package = "classpredict"), as.is = TRUE)
# training/test information
testSet <- expdesign[, 10]
trainingInd <- which(testSet == "training")
predictInd <- which(testSet == "predict")
ind1 <- which(expdesign[trainingInd, 4] == "BRCA1")
ind2 <- which(expdesign[trainingInd, 4] == "BRCA2")
ind <- c(ind1, ind2)
exprTrain <- x[, ind]
colnames(exprTrain) <- expdesign[ind, 1]
exprTest <- x[, predictInd]
colnames(exprTest) <- expdesign[predictInd, 1]
exprTrain
is a 3226*11 matrix with rows representing genes and columns representing 11 training samples.
## s1996 s1822 s1714 s1224 s1252 s1510 s1905 s1900 s1787 s1721 s1486
## 1 -3.08179379 -2.730392933 -1.87446904 -2.28824496 -0.345387012 -1.42321134 -1.68280995 -1.77760768 -0.24100803 -0.29195589 0.241469175
## 2 0.27810180 -0.201139927 -0.53343225 -0.57929373 -0.287439704 -0.88264304 -1.00000000 -0.41503757 -1.02236784 -0.74802077 -1.166995645
## 3 0.43758011 0.104796171 0.95334989 -0.22050031 0.353232265 -0.67318958 0.99407518 0.51096189 -0.16438681 0.02185956 0.241469175
## 4 -0.83893764 -0.235628277 0.61951971 0.81221521 -0.418143421 -0.52509099 0.76970232 0.26303440 0.64296818 1.45843005 -0.041464776
## 5 -0.43409583 0.067563236 0.76553470 -0.09386685 -0.418143421 0.38414347 -0.27252591 -0.19264519 -0.51457310 -0.62403196 -0.017618060
## ......
exprTest
is a 3226*4 matrix with the expressions of four new samples.
## s1816 s1616 s1063 s1936
## 1 -0.8214026 -0.5618789 -0.4611339 -0.93288577
## 2 -0.8614801 -1.6322682 -0.7737241 -0.33342373
## 3 0.4066253 0.4381211 0.4116309 1.25153875
## 4 1.3286228 1.3737305 0.5574818 1.02272010
## 5 1.3330686 1.2422009 0.0402640 0.03394729
The following procedure develops from all samples seven classifiers which are used to predict classes of new samples. Individual genes that are used by classifiers are selected at the 0.001 significance level. Random variance model will be used for univariate tests. The leave-one-out cross-validation method is employed to evaluate class prediction accuracy by selecting predictors and training classifiers from cross-validated traning set and calculating the cross-validated estimate of misclassification error over the cross-validated test set. Equal prior probabilities are assumed for the Bayesian Compound Covariate Predictor.
projectPath = tempdir()
outputName = "classPredictionBrca"
generateHTML = TRUE
prevalence <- c(length(ind1)/(length(ind1)+length(ind2)),length(ind2)/(length(ind1)+length(ind2)))
names(prevalence) <- c("BRCA1", "BRCA2")
resList <- classPredict(exprTrain = exprTrain, exprTest = exprTest, isPaired = FALSE,
pairVar.train = NULL, pairVar.test = NULL, geneId,
cls = c(rep("BRCA1", length(ind1)), rep("BRCA2", length(ind2))),
pmethod = c("ccp", "bcc", "dlda", "knn", "nc", "svm"),
geneSelect = "igenes.univAlpha",
univAlpha = 0.001, univMcr = 0, foldDiff = 0, rvm = TRUE, filter = filter,
ngenePairs = 25, nfrvm = 10, cvMethod = 1, kfoldValue = 10, bccPrior = 1,
bccThresh = 0.8, nperm = 0, svmCost = 1, svmWeight =1, fixseed = 1,
prevalence = prevalence, projectPath = projectPath,
outputName = outputName, generateHTML = generateHTML)
if (generateHTML)
browseURL(file.path(projectPath, "Output", outputName,
paste0(outputName, ".html")))
It returns the same list as shown in the Quick Start Section. For more details about classPredict
, please type help("classPredict")
in the R console.