Next revision | Previous revisionLast revisionBoth sides next revision |
start [2012/09/03 13:16] – external edit 127.0.0.1 | start [2020/10/12 14:59] – pk7z |
---|
The BioConnector Wiki is a standards compliant, simple to use wiki, mainly aimed at creating documentation. It's simple but powerful [[https://www.dokuwiki.org/syntax|syntax]] makes it easy to share ideas, collaborate on projects, and document workflows. The wiki keeps a record of every [[https://www.dokuwiki.org/attic|revision]] and a list of [[https://www.dokuwiki.org/recent_changes|recent changes]], enabling transparency, reproducibility, and provenance throughout the data analysis and research life cycle. The wiki also supports [[https://www.dokuwiki.org/syntax#syntax_highlighting|syntax highlighting]] of nearly any programming language (including Perl, Python, and R), uploading and embedding of [[https://www.dokuwiki.org/images|images]] and other media, and organization of pages and permissions via [[https://www.dokuwiki.org/namespaces|namespaces]]. | The BioConnector Wiki is a standards compliant, simple to use wiki, mainly aimed at creating documentation. It's simple but powerful [[https://www.dokuwiki.org/syntax|syntax]] makes it easy to share ideas, collaborate on projects, and document workflows. The wiki keeps a record of every [[https://www.dokuwiki.org/attic|revision]] and a list of [[https://www.dokuwiki.org/recent_changes|recent changes]], enabling transparency, reproducibility, and provenance throughout the data analysis and research life cycle. The wiki also supports [[https://www.dokuwiki.org/syntax#syntax_highlighting|syntax highlighting]] of nearly any programming language (including Perl, Python, and R), uploading and embedding of [[https://www.dokuwiki.org/images|images]] and other media, and organization of pages and permissions via [[https://www.dokuwiki.org/namespaces|namespaces]]. |
| |
Interested in documenting your group's work using the wiki? [[http://www.google.com/recaptcha/mailhide/d?k=016RBE9TN7Hvz1Ju8fMFOpzA==&c=tKJn9IgI_y0sZjVgHqezgmc0aljyrZADEE2snwoA7FE=|Email Stephen Turner]], director of the [[http://bioinformatics.virginia.edu|Bioinformatics Core]], for an account. | Interested in documenting your group's work using the wiki? [[http://www.google.com/recaptcha/mailhide/d?k=016RBE9TN7Hvz1Ju8fMFOpzA==&c=tKJn9IgI_y0sZjVgHqezgmc0aljyrZADEE2snwoA7FE=|Email Pankaj Kumar]], director of the [[http://bioinformatics.virginia.edu|Bioinformatics Core]], for an account. |
| |
====== Example Entry ====== | |
| |
The example entry below was originally part of a [[http://gettinggeneticsdone.blogspot.com/2012/03/pathway-analysis-for-high-throughput.html|tutorial on pathway analysis blog post on Getting Genetics Done]] by [[http://www.google.com/recaptcha/mailhide/d?k=016RBE9TN7Hvz1Ju8fMFOpzA==&c=tKJn9IgI_y0sZjVgHqezgmc0aljyrZADEE2snwoA7FE=|Stephen Turner]] | |
| |
===== Pathway Analysis: Gene expression and colon cancer susceptibility ===== | |
| |
Data and background from: Hong Y, Ho KS, Eu KW, Cheah PY. A susceptibility gene set for early onset colorectal cancer that integrates diverse signaling pathways: implication for tumorigenesis. //Clin Cancer Res// 2007 Feb 15;13(4):1107-14. PMID: [[http://www.ncbi.nlm.nih.gov/pubmed/17317818|17317818]]. Download the original data on GEO using accession number: [[http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107|GSE4107]]. | |
| |
==== Background ==== | |
| |
Causative genes for autosomal dominantly inherited familial adenomatous polyposis (FAP) and hereditary non-polyposis colorectal cancer (HNPCC) have been well characterized. There is, however, another 10-15 % early onset colorectal cancer (CRC) whose genetic components are currently unknown. In this study, we used DNA chip technology to systematically search for genes differentially expressed in early onset CRC. | |
| |
==== Methods ==== | |
| |
=== Overall design === | |
| |
RNA extracted from colonic mucosa of healthy controls(10samples) and patients(12samples) were analyzed using GeneChip U133-Plus 2.0 Array. Patients and controls were age- (50 or less), ethnicity- (Chinese) and tissue-matched. T-test, hierarchical clustering, mean fold-change and principal component analysis were used to identify genes that differentiate between patients and controls. These were subsequently verified by real-time polymerase chain reaction (PCR) technology. Signaling Pathway Impact Analysis was used to perform a systems biology pathway analysis. | |
| |
=== R code === | |
| |
<code rsplus examplecode.R> | |
# These are bioconductor packages. See http://www.bioconductor.org/install/ for installation instructions | |
library(Biobase) | |
library(GEOquery) | |
library(limma) | |
library(SPIA) | |
library(hgu133plus2.db) | |
| |
# load series and platform data from GEO: | |
# http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE4107 | |
eset <- getGEO("GSE4107", GSEMatrix =TRUE)[[1]] | |
| |
# log transform | |
exprs(eset) <- log2(exprs(eset)) | |
| |
# set up a design matrix and contrast matrix | |
design <- model.matrix(~0+as.factor(c(rep(1,12), rep(0,10)))) | |
colnames(design) <- c("cancer","normal") | |
contrast.matrix <- makeContrasts(cancer_v_normal=cancer-normal, levels=design) | |
| |
# run the analysis with empirical Bayes moderated standard errors | |
fit <- lmFit(eset,design) | |
fit2 <- contrasts.fit(fit, contrast.matrix) | |
fit2 <- eBayes(fit2) | |
| |
| |
# get useful information for the top 25 genes | |
top <- topTable(fit2, coef="cancer_v_normal", number=nrow(fit2), adjust.method="fdr") | |
top <- na.omit(subset(top, select=c(ID, logFC, adj.P.Val))) | |
top$ID <- as.character(top$ID) | |
| |
# annotate with entrez info | |
top$ENTREZ<-unlist(as.list(hgu133plus2ENTREZID[top$ID])) | |
top<-top[!is.na(top$ENTREZ),] | |
top<-top[!duplicated(top$ENTREZ),] | |
top$SYMBOL<-unlist(as.list(hgu133plus2SYMBOL[top$ID])) | |
top<-top[!is.na(top$SYMBOL),] | |
top<-top[!duplicated(top$SYMBOL),] | |
| |
# significant genes is a vector of fold changes where the names | |
# are ENTREZ gene IDs. The background set is a vector of all the | |
# genes represented on the platform. | |
sig_genes <- subset(top, adj.P.Val<0.01)$logFC | |
names(sig_genes) <- subset(top, adj.P.Val<0.01)$ENTREZ | |
all_genes <- top$ENTREZ | |
| |
# run SPIA. | |
spia_result <- spia(de=sig_genes, all=all_genes, organism="hsa", plots=TRUE) | |
| |
# Once you start running SPIA you'll see it go through all the KEGG pathways | |
# for your organism. This will take a few minutes! Be patient. | |
# Done pathway 1 : RNA transport.. | |
# Done pathway 2 : RNA degradation.. | |
# Done pathway 3 : PPAR signaling pathway.. | |
# Done pathway 4 : Fanconi anemia pathway.. | |
# Done pathway 5 : MAPK signaling pathway.. | |
# Done pathway 6 : ErbB signaling pathway.. | |
# Done pathway 7 : Calcium signaling pathway.. | |
# Done pathway 8 : Cytokine-cytokine receptor int.. | |
# Done pathway 9 : Chemokine signaling pathway.. | |
# Done pathway 10 : Neuroactive ligand-receptor in.. | |
| |
head(spia_result) | |
plotP(spia_result, threshold=0.05) | |
</code> | |
| |
==== Results ==== | |
| |
The output from Signaling Pathway Impact Analysis is a list of pathways, whether they're activated or inhibited, and three different p-values. p(NDE) is the p-value on the Number of Differentially Expressed genes. This is nearly identical to the GO-overrepresentation analysis - it's the significance of the over-representation of differentially expressed genes in the given pathway. The p(PERT) is the significance of the overall perturbation of the pathway, which takes into account topology. [[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732297/figure/F1/|Figure 1 in the SPIA paper]] explains this well. p(G) is the combined p-value from both p(NDE) and p(PERT). p(G)<sub>FDR</sub> is the FDR-corrected overall p-value. The SPIA plots show both the p(NDE) and p(PERT) on each axis, so the most significant things are up in the upper right corner. [[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732297/figure/F4/|Figure 4 in the SPIA paper]] explains this well. | |
| |
{{:spia-plot.png?nolink&|}} | |
| |
Limitations: It's worth noting that the method employed above has limitations. We don't fully understand biology, and our understanding of molecular networks and signaling pathways is still very low-resolution. We also don't have information about how different isoforms have different effects - which is something we'll get from RNA-seq experiments. Annotations are often incorrect and inaccurate, and we don't have very much cell-type specific or dynamic information about these pathways. Finally, the analysis of large gene lists with the current enrichment tools is still more of an exploratory data-mining procedure rather than a pure statistical solution and analytical endpoint. | |