gsea {GSEA} | R Documentation |
This function conducts gene set enrichment analysis among pre-defined classes and for survival data and quantitative trait data, respectively. It finds BioCarta pathways, KEGG pathways, transcription factor target lists or microRNA target lists with statistically significant differences among pre-defined classes. It aslo finds gene sets that are correlated with survival or quantitative trait of samples. A gene set is selected if its corresponding re-sampling p-value is below the specified threshold. The re-sampling p-value is calculated through permutation tests. Basically, 100,000 LS (log score) or KS (Kolmogorov-Smirnov) permutation tests are conducted to calculate a p-value measuring the gene set enrichment. For each gene set, N genes are randomly selected from a gene list in analysis, where N is the number of genes belonging to that gene set For each permutation, the LS or KS statistics associated with that gene set are computed based on the p-value. The p-value for that gene set is then defined as the proportion of permutations for which the LS (KS) statistics are larger than the observed LS (KS) statistics from original data.
gsea(expr, filter, surv = FALSE, time = NULL, status = NULL, quant = NULL, geneId, cls, isPaired = FALSE, pairID = NULL, doGroupComparison = FALSE, grpID = NULL, hasDuplic = FALSE, duplicID = NULL, rvm = TRUE, nperm.GSA = 200, geneSetType = c("BioCarta", "KEGG", "TF", "microRNA"), fromKEGGdb = TRUE, frommiRTarBasev4 = TRUE, pathwayMin = 5, pathwayMax = 1000, isSingleChannel = T, alpha = 0.05, seed = 123456, organism = c("human", "mouse"), corrtest.method = c("pearson", "spearman"), projectPath, outputName = "GeneSetClassComparison", popHTML = TRUE)
expr |
matrix of gene expression data for training samples. Rows are genes and columns are arrays. Its column names must be provided. |
filter |
vector of 1's or 0's of the same length as genes. 1 means to keep the gene
while 0 means to exclude genes
from the gene set enrichment analysis. If |
surv |
logical specifying if it is survival data or not. Default is |
time |
vector specifying survival time. Defualt is |
status |
vector specifying survival status. Defualt is |
quant |
vector specifying the quantitative trait of samples. Default is |
geneId |
matrix/data frame of gene IDs. Rows are IDs and columns are annotations
such as Symbol and EntrezID. Its row names must be provided as IDs, and one of its
column names must represent gene symbols. The rows must be in the same order as
those of IDs in |
cls |
vector of sample classes. |
isPaired |
logical. If |
pairID |
vector of pairing variables for all samples. |
doGroupComparison |
logical. If |
grpID |
vector specifying the group ID. Default is |
hasDuplic |
logical. If |
duplicID |
vector specifying array replicates. Default is |
rvm |
logical. If |
nperm.GSA |
numeric specifying the number of permutation tests in |
geneSetType |
vector specifying the type of gene sets for enrichment analysis. It can be "BioCarta" , "KEGG", "TF" and "microRNA", representing BioCarta pathways, KEGG pathways, experimentally verified transcription factor targets and experimentally verified microRNA targets, respectively. |
fromKEGGdb |
logical specifying if KEGG pathways are obtained from the KEGG.db or KEGGREST R package.
When |
frommiRTarBasev4 |
logical specifying if the microRNA target lists are obtained from the miRTarBase database v4 or v7
at http://mirtarbase.mbc.nctu.edu.tw/php/download.php. When |
pathwayMin |
the minimal number of genes being allowed in one pathway. Default is 5. |
pathwayMax |
the maximal number of genes being allowed in one pathway. Default is 2000. |
isSingleChannel |
logical. If |
alpha |
numeric specifying the significant level being set for significantly enriched pathways. Default is 0.05. |
seed |
numeric specifying random seed for permutation tests. Default is 123456. |
organism |
character specifying the organism. It can be "human" or "mouse". When |
corrtest.method |
charcater specifying the Pearson or Spearman correlation test to find enriched gene sets given the quantitative trait data. It can be "pearson" or "spearman". |
projectPath |
character specifying the ouput directory. |
outputName |
character specifying the directory for keeping the HTML document and associated results. Default is "GeneSetClassComparison". |
popHTML |
logical. If TRUE, an HTML document with significantly enriched pathways will be popped up. The file will
be saved as <projectPath>/Output/<outputName>/<outputName>.html.
Default is |
a data frame containing the enriched BioCarta pathways, KEGG pathways, transcription factor target lists or microRNA target lists, pathway description, number of genes in each enriched pathway, p-values for LS/KS permutation tests and Efron-Tibshirani's GSA tests.
BRB-ArrayTools Development Team, arraytools@emmes.com
Xu X, Zhao Y and Simon R. Gene Set Expression Comparison kit for BRB-ArrayTools. Bioinformatics 2008. 24: 137-9.
BRB-ArrayTools manual: https://brb.nci.nih.gov/BRB-ArrayTools/Documentation.html.
## find BioCarta significant pathways among two classes in a breast cancer dataset dataset<-"Brca" # gene IDs geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"), package = "GSEA"), as.is = TRUE, colClasses = "character") # gene expression x <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"), package = "GSEA"), header = FALSE) filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"), package = "GSEA"), quiet = TRUE) # sample information expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"), package = "GSEA"), as.is = TRUE) ind1 <- which(expdesign[, 4] == "BRCA1") ind2 <- which(expdesign[, 4] == "BRCA2") ind <- c(ind1, ind2) expr <- x[, ind] colnames(expr) <- expdesign[ind, 1] projectPath <- file.path(Sys.getenv("HOME"),"Brca") outputName <- "GeneSetClassComparison" cls <- c(rep("BRCA1", length(ind1)), rep("BRCA2", length(ind2))) gsea(expr = expr, filter = filter, geneId = geneId, cls = cls, geneSetType = "BioCarta", isSingleChannel = FALSE, alpha = 0.005, organism = "human", projectPath = projectPath, outputName = outputName) ## find BioCarta pathways that are significantly correlated with survival dataset<-"Pomeroy" # gene IDs geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"), package = "GSEA"), as.is = TRUE, colClasses = "character") # expression data x <- read.delim(system.file("extdata", paste0(dataset, "_LOGINT.TXT"), package = "GSEA"), header = FALSE) # filter information, 1 - pass the filter, 0 - filtered filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"), package = "GSEA"), quiet = TRUE) # sample information expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"), package = "GSEA"), as.is = TRUE) time <- expdesign[,7] status <- expdesign[,12] ind1 <- which(status == 0) ind2 <- which(status == 1) ind <- c(ind1, ind2) expr <- x[, ind] time <- time[ind] status <- status[ind] colnames(expr) <- expdesign[ind, 1] projectPath <- file.path(Sys.getenv("HOME"),dataset) outputName <- "GeneSetSurvivalComparison" gsea(expr = expr, filter = filter, surv = T, time = time, status = status, geneId = geneId, rvm = FALSE, geneSetType = "BioCarta", alpha = 0.005, organism = "human", projectPath = projectPath, outputName = outputName) ## find KEGG pathways that are significally correlated with simulated ## quantitative trait of samples dataset<-"Brca" # gene IDs geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"), package = "GSEA"), as.is = TRUE, colClasses = "character") # expression data expr <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"), package = "GSEA"), header = FALSE) # filter information, 1 - pass the filter, 0 - filtered filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"), package = "GSEA"), quiet = TRUE) # sample information expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"), package = "GSEA"), as.is = TRUE) quant <- expdesign[,11] colnames(expr) <- expdesign[, 1] projectPath <- file.path(Sys.getenv("HOME"),"Brca") outputName <- "GeneSetQTComparison" gsea(expr = expr, filter = filter, quant = quant, geneId = geneId, geneSetType = "KEGG", isSingleChannel = FALSE, alpha = 0.005, organism = "human", corrtest.method = "pearson", projectPath = projectPath, outputName = outputName) ## find significant TF target gene lists among two classes in a breast cancer dataset dataset<-"Brca" # gene IDs geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"), package = "GSEA"), as.is = TRUE, colClasses = "character") # gene expression x <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"), package = "GSEA"), header = FALSE) filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"), package = "GSEA"), quiet = TRUE) # sample information expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"), package = "GSEA"), as.is = TRUE) ind1 <- which(expdesign[, 4] == "BRCA1") ind2 <- which(expdesign[, 4] == "BRCA2") ind <- c(ind1, ind2) expr <- x[, ind] colnames(expr) <- expdesign[ind, 1] projectPath <- file.path(Sys.getenv("HOME"),"Brca") outputName <- "GeneSetClassComparison" cls <- c(rep("BRCA1", length(ind1)), rep("BRCA2", length(ind2))) gsea(expr = expr, filter = filter, geneId = geneId, cls = cls, geneSetType = "TF", isSingleChannel = FALSE, alpha = 0.005, organism = "human", projectPath = projectPath, outputName = outputName) ## find significant microRNA target gene lists among two classes ## in a breast cancer dataset dataset<-"Brca" # gene IDs geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"), package = "GSEA"), as.is = TRUE, colClasses = "character") # gene expression x <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"), package = "GSEA"), header = FALSE) filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"), package = "GSEA"), quiet = TRUE) # sample information expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"), package = "GSEA"), as.is = TRUE) ind1 <- which(expdesign[, 4] == "BRCA1") ind2 <- which(expdesign[, 4] == "BRCA2") ind <- c(ind1, ind2) expr <- x[, ind] colnames(expr) <- expdesign[ind, 1] projectPath <- file.path(Sys.getenv("HOME"),"Brca") outputName <- "GeneSetClassComparison" cls <- c(rep("BRCA1", length(ind1)), rep("BRCA2", length(ind2))) gsea(expr = expr, filter = filter, geneId = geneId, cls = cls, geneSetType = "microRNA", isSingleChannel = FALSE, alpha = 0.005, organism = "human", projectPath = projectPath, outputName = outputName)