1 Introduction

The GSEA R package conducts gene set enrichment analysis among pre-defined classes and for survival data and quantitative trait data. It finds BioCarta pathways, KEGG pathways, experimentally verified transcription factor target lists or experimentally verified microRNA target lists with statistically significant differences among pre-defined classes. It aslo finds gene sets that are significantly correlated with survival or quantitative trait of samples.

2 Installation

2.1 From the binary version

To install the package from its binary version, you need to manually pre-install several dependency packages (i.e., Biobase, GSA, bitops, Cairo) by running the following script in R console:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Biobase")
install.packages(c("GSA","bitops","Cairo"), repos = "https://cran.rstudio.com")

Afterwards, please install the GSEA R package through the local installation in the R console. Click on “Packages” on the R menu bar, and select “install package(s) from local files”. Please browse for “GSEA_0.1.0.zip” and click on “open”.

2.2 From the source version

To install it from source, please first install the Biobase, GSA, bitops and Cairo R packages, and then do the following in the R console after setting the working directory to the one where GSEA_0.1.0.tar.gz is located:

if(!requireNamespace("remotes", quietly = TRUE))
  install.packages("remotes", repos = "https://cran.rstudio.com")
remotes::install_local("GSEA_0.1.0.tar.gz", dependencies = TRUE)

3 Gene sets

3.1 Biocarta and KEGG pathways

“BioCarta” is a trademark of BioCarta, Incorporated, and the pathways included in this package provide displays of gene interactions within pathways for human and mouse cellular processes. “KEGG” (Kyoto Encyclopedia of Genes and Genomes) is an original database product developed by Kinoru Kanehisa and associates, and the built-in pathways provide displays of interactions between enzymes and intermediary compounds within metabolic and regulatory pathways. The BioCarta and KEGG pathway gene lists used in this package are obtained from the Cancer Genome Anatomy Project and by using KEGGREST R package, respectively. There are 314 Biocarta pathways for human, 278 Biocarta pathways for mouse, 333 KEGG pathways from the KEGGREST R package and 229 KEGG pathways from the KEGG.db R package being built-in in the GSEA package.

# KEGG pathways from the KEGG.db R package
list.files(system.file("Pathways/KEGG", package = "GSEA"))
# KEGG pathways from the KEGGREST R package
list.files(system.file("Pathways/KEGG_new", package = "GSEA"))
# Biocarat pathways
list.files(system.file("Pathways/Biocarta", package = "GSEA"))

The description of 229 KEGG pathways from the KEGG.db R package can be obtrained through the following code:

desc <- read.table(system.file("Pathways/Names_KEGG.txt", package = "GSEA"), sep="\t", header=F)
if(!requireNamespace("DT", quietly = TRUE) )
  install.packages("DT", repos = "https://cran.rstudio.com")
DT::datatable(head(desc, n = nrow(desc)), options = list(pageLength = 5)) 

The description of 333 KEGG pathways from the KEGGREST R package can be obtrained through the following code:

desc <- read.table(system.file("Pathways/Names_KEGG_new.txt", package = "GSEA"), sep="\t", header=F)
DT::datatable(head(desc, n = nrow(desc)), options = list(pageLength = 5)) 

The description of Biocarta pathways can be obtrained through the following code:

desc <- read.table(system.file("Pathways/Names_Biocarta.txt", package = "GSEA"), sep="\t", header=F)
DT::datatable(head(desc, n = nrow(desc)), options = list(pageLength = 5))

3.2 Transcription factor targets

This package includes sets of genes that have been experimentally verified as targets of the same transcription factor. We used transcription factor binding curation information in the Transcriptional Regulatory Element Database (TRED) (F. Zhao et al. 2005) to eliminate targets without any experimental verification.

# Gene sets of transcription factor targets for human
list.files(system.file("Transcription_factor/Tred_human", package = "GSEA"))

3.3 MicroRNA targets

Gene sets of experimentally verified microRNA targets are also included. These microRNA target gene lists were created based on miRTarBase v4.x database(Hsu et al. 2014) and v7.0 database(Chou et al. 2017), respectively. Genes in each set are targets of the same microRNA. All the microRNA-target interactions were verified by reporter assay or western blot experiments. Separate sets of human and mouse genes are provided.

# Gene sets of microRNA targets for human from miRTarBase v4
list.files(system.file("MicroRNA/v4/experimentally_verified_human", package = "GSEA"))
# Gene sets of microRNA targets for human from miRTarBase v7
list.files(system.file("MicroRNA/v7/experimentally_verified_human", package = "GSEA"))

4 Find enriched gene sets

gsea is the main R function to perform the pathway enrichment analysis. In this section, we will look into details about how to prepare inputs for the gsea function when finding enriched gene sets among pre-defined classes, for survival data, and for quantitative trait of samples.

A pathway is selected if its corresponding re-sampling p-value is below the specified threshold. The re-sampling p-value is calculated through permutation tests. Basically, 100,000 LS (log score) or KS (Kolmogorov-Smirnov) permutation tests are conducted to calculate a p-value measuring the gene set enrichment. For each pathway, N genes are randomly selected from a gene list in analysis, where N is the number of genes belonging to that pathway. For each permutation, the log score (LS) or Kolmogorov-Smirnov (KS) statistics associated with that gene set are computed based on the p-values (Xu, Zhao, and Simon 2007).

For a set of N genes, the LS statistic is defined as \[{\rm LS} = \sum_{i=1}^{N} \frac{-{\rm log}p_i}{N},\] which is the mean negative natural logarithm of the p-values of the appropriate single gene univariate test.

The KS statistic is defined as \[{\rm KS}= \max_{i=1}^N \bigg(\frac{i}{N}-p_{(i)}\bigg),\; p_{(1)} \le p_{(2)} \le ... \le p_{(N)},\] where \((p_{(1)} , p_{(2)}, ..., p_{(N)})\) is ordered increasingly of \(p_1, p_2, ..., p_N\). The KS statistic is the maximum difference between \(i/N\) and \(p_{(i)}\), where \(p_{(i)}\) is the ith smallest p-value of the univariate test. This is the Kolmogorov-Smirnov statistic for testing if the p-values are of a uniform distribution.

The p-value for that pathway is then defined as the proportion of permutations for which the LS (KS) statistics are larger than the observed LS (KS) statistics from original data.

In addition, Efron-Tibshirani’s Gene Set Analysis (Efron, Tibshirani, and others 2007), is implemented to use “maxmean” statistics for assessing signficance of pre-defined gene-sets. GSA uses the maxmean statistic which is the mean of the positive or negative part of gene scores in the gene set, whichever is large in absolute value. Efron and Tibshirani show that this is often more powerful than the modified Kolmogorov-Smirnov statistic used in the original GSEA procedure.

4.1 Data with pre-defined classess

The package can find BioCarta or KEGG pathways with statistically significant differences among pre-defined classes. We here use the “Brca” sample data. The package contains the following “Brca” data information:

  • Brca_LOGRAT.txt : a table of expression data with rows representing genes and columns representing samples;

  • Brca_FILTER.txt: a list of filtering information, where 1 means the corresponding gene passes the filters while 0 means it is excluded from analysis;

  • Brca_GENEID.txt: a table of gene information corresponding to row information of Brca_LOGRAT.txt and Brca_FILTER.txt;

  • Brca_EXPDESIGN.txt: a table with sample information such as classes.

There are a total of 15 samples. We run the following code to obtain objects like expr, filter, geneId and cls as inputs to gsea.

The following example finds significally enriched BioCarta pathways among two classes.

library(GSEA)
dataset<-"Brca"
# gene IDs
geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"),
                                 package = "GSEA"), as.is = TRUE, colClasses = "character")
# gene expression
x <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"),
                            package = "GSEA"), header = FALSE)
filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"),
                           package = "GSEA"), quiet = TRUE)
# sample information
expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"),
                                    package = "GSEA"), as.is = TRUE)
ind1 <- which(expdesign[, 4] == "BRCA1")
ind2 <- which(expdesign[, 4] == "BRCA2")
ind <- c(ind1, ind2)
expr <- x[, ind]
colnames(expr) <- expdesign[ind, 1]
projectPath <- file.path(Sys.getenv("HOME"),"Brca")
outputName <- "GeneSetClassComparison"
cls <- c(rep("BRCA1", length(ind1)), rep("BRCA2", length(ind2)))
res.class <- gsea(expr = expr,
            filter = filter,
            geneId = geneId,
            cls = cls,
            geneSetType = "BioCarta",
            isSingleChannel = FALSE,
            alpha = 0.005,
            organism = "human",
            projectPath = projectPath,
            outputName = outputName,
            popHTML = FALSE)
## [1] "Creating ExpressionSet object ..."
## [1] "Saving ExpressionSet object to disk"
## [1] "Loading data and gene sets ..."
## [1] "Conducting GSA test for 199 gene sets ..."
## perm= 10 / 200 
## perm= 20 / 200 
## perm= 30 / 200 
## perm= 40 / 200 
## perm= 50 / 200 
## perm= 60 / 200 
## perm= 70 / 200 
## perm= 80 / 200 
## perm= 90 / 200 
## perm= 100 / 200 
## perm= 110 / 200 
## perm= 120 / 200 
## perm= 130 / 200 
## perm= 140 / 200 
## perm= 150 / 200 
## perm= 160 / 200 
## perm= 170 / 200 
## perm= 180 / 200 
## perm= 190 / 200 
## perm= 200 / 200 
## [1] "Loading data and gene sets ..."
## 
## Preparing output tables ...
## 
## An html document with significantly enriched pathways was saved inC:/Users/tchen/Documents/Brca/Output/GeneSetClassComparison/GeneSetClassComparison.html
DT::datatable(res.class, options = list(pageLength = 5))

For more details about gsea, please type ?gsea in the R console.

4.2 Survival data

It can find gene sets that are significantly correlated with survival. The “Pomeroy” dataset with 60 samples of 1914 genes passing the filter are used. The following example finds BioCarta pathways that are correlated with survival with statistical significance.

dataset<-"Pomeroy"
# gene IDs
geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"),
                                 package = "GSEA"), as.is = TRUE, colClasses = "character")
# expression data
x <- read.delim(system.file("extdata", paste0(dataset, "_LOGINT.TXT"),
                            package = "GSEA"), header = FALSE)
# filter information, 1 - pass the filter, 0 - filtered
filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"),
                           package = "GSEA"), quiet = TRUE)
# sample information
expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"),
                                   package = "GSEA"), as.is = TRUE)
time <- expdesign[,7]
status <- expdesign[,12]
ind1 <- which(status == 0)
ind2 <- which(status == 1)
ind <- c(ind1, ind2)
expr <- x[, ind]
time <- time[ind]
status <- status[ind]
colnames(expr) <- expdesign[ind, 1]
projectPath <- file.path(Sys.getenv("HOME"),dataset)
outputName <- "GeneSetSurvivalComparison"
res.surv <- gsea(expr = expr,
                 filter = filter,
                 surv = T,
                 time = time,
                 status = status,
                 geneId = geneId,
                 rvm = FALSE,
                 geneSetType = "BioCarta",
                 alpha = 0.005,
                 organism = "human",
                 projectPath = projectPath,
                 outputName = outputName,
                 popHTML = FALSE)
## [1] "Creating ExpressionSet object ..."
## [1] "Saving ExpressionSet object to disk"
## [1] "Loading data and gene sets ..."
## [1] "Conducting GSA test for 147 gene sets ..."
## perm= 10 / 200 
## perm= 20 / 200 
## perm= 30 / 200 
## perm= 40 / 200 
## perm= 50 / 200 
## perm= 60 / 200 
## perm= 70 / 200 
## perm= 80 / 200 
## perm= 90 / 200 
## perm= 100 / 200 
## perm= 110 / 200 
## perm= 120 / 200 
## perm= 130 / 200 
## perm= 140 / 200 
## perm= 150 / 200 
## perm= 160 / 200 
## perm= 170 / 200 
## perm= 180 / 200 
## perm= 190 / 200 
## perm= 200 / 200 
## 
## Preparing output tables ...
## 
## An html document with significantly enriched pathways was saved inC:/Users/tchen/Documents/Pomeroy/Output/GeneSetSurvivalComparison/GeneSetSurvivalComparison.html
DT::datatable(res.surv, options = list(pageLength = 5))

4.3 Quantitative trait of samples

In this case, it finds genes that are significantly correlated with a specified quantitative trait such as age. Spearman or Pearson correlation coefficients are used as a measure of correlation and to compute parametric p-values. The following example finds significantly enriched KEGG pathways from the KEGGREST R package.

dataset<-"Brca"
# gene IDs
geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"),
                                 package = "GSEA"), as.is = TRUE, colClasses = "character")
# expression data
expr <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"),
                               package = "GSEA"), header = FALSE)
# filter information, 1 - pass the filter, 0 - filtered
filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"),
                           package = "GSEA"), quiet = TRUE)
# sample information
expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"),
                                    package = "GSEA"), as.is = TRUE)
quant <- expdesign[,11]
colnames(expr) <- expdesign[, 1]
projectPath <- file.path(Sys.getenv("HOME"),"Brca")
outputName <- "GeneSetQTComparison"
res.qt <- gsea(expr = expr,
               filter = filter,
               quant = quant,
               geneId = geneId,
               geneSetType = "KEGG",
               fromKEGGdb = FALSE,
               isSingleChannel = FALSE,
               alpha = 0.005,
               organism = "human",
               corrtest.method = "pearson",
               projectPath = projectPath,
               outputName = outputName,
               popHTML = FALSE)
## [1] "Creating ExpressionSet object ..."
## [1] "Saving ExpressionSet object to disk"
## [1] "Loading data and gene sets ..."
## [1] "Conducting GSA test for 296 gene sets ..."
## perm= 10 / 200 
## perm= 20 / 200 
## perm= 30 / 200 
## perm= 40 / 200 
## perm= 50 / 200 
## perm= 60 / 200 
## perm= 70 / 200 
## perm= 80 / 200 
## perm= 90 / 200 
## perm= 100 / 200 
## perm= 110 / 200 
## perm= 120 / 200 
## perm= 130 / 200 
## perm= 140 / 200 
## perm= 150 / 200 
## perm= 160 / 200 
## perm= 170 / 200 
## perm= 180 / 200 
## perm= 190 / 200 
## perm= 200 / 200 
## [1] "Loading data and gene sets ..."
## 
## Preparing output tables ...
## 
## An html document with significantly enriched pathways was saved inC:/Users/tchen/Documents/Brca/Output/GeneSetQTComparison/GeneSetQTComparison.html
DT::datatable(res.qt, options = list(pageLength = 5))

5 Session info

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] GSEA_0.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2          knitr_1.25          magrittr_1.5       
##  [4] BiocGenerics_0.30.0 xtable_1.8-4        R6_2.4.0           
##  [7] GSA_1.03.1          stringr_1.4.0       tools_3.6.0        
## [10] parallel_3.6.0      DT_0.9              Biobase_2.44.0     
## [13] xfun_0.7            htmltools_0.3.6     crosstalk_1.0.0    
## [16] yaml_2.2.0          digest_0.6.21       shiny_1.3.2        
## [19] later_0.8.0         bitops_1.0-6        htmlwidgets_1.3    
## [22] promises_1.0.1      codetools_0.2-16    evaluate_0.14      
## [25] mime_0.6            rmarkdown_1.15      stringi_1.4.3      
## [28] compiler_3.6.0      jsonlite_1.6        httpuv_1.5.2       
## [31] Cairo_1.5-10

References

Chou, Chih-Hung, Sirjana Shrestha, Chi-Dung Yang, Nai-Wen Chang, Yu-Ling Lin, Kuang-Wen Liao, Wei-Chi Huang, et al. 2017. “MiRTarBase Update 2018: A Resource for Experimentally Validated microRNA-Target Interactions.” Nucleic Acids Research 46 (D1). Oxford University Press: D296–D302.

Efron, Bradley, Robert Tibshirani, and others. 2007. “On Testing the Significance of Sets of Genes.” The Annals of Applied Statistics 1 (1). Institute of Mathematical Statistics: 107–29.

Hsu, Sheng-Da, Yu-Ting Tseng, Sirjana Shrestha, Yu-Ling Lin, Anas Khaleel, Chih-Hung Chou, Chao-Fang Chu, et al. 2014. “MiRTarBase Update 2014: An Information Resource for Experimentally Validated miRNA-Target Interactions.” Nucleic Acids Research 42 (D1). Oxford University Press: D78–D85.

Xu, Xiaojiang, Yingdong Zhao, and Richard Simon. 2007. “Gene Set Expression Comparison Kit for Brb-Arraytools.” Bioinformatics 24 (1). Oxford University Press: 137–39.

Zhao, Fang, Zhenyu Xuan, Lihua Liu, and Michael Q Zhang. 2005. “TRED: A Transcriptional Regulatory Element Database and a Platform for in Silico Gene Regulation Studies.” Nucleic Acids Research 33 (suppl_1). Oxford University Press: D103–D107.