Differences

This shows you the differences between two versions of the page.

--- start [2012/09/03 13:16] – external edit 127.0.0.1
+++ start [2020/10/12 14:59] – pk7z
@@ Line 3: / Line 3: @@
 The BioConnector Wiki is a standards compliant, simple to use wiki, mainly aimed at creating documentation. It's simple but powerful [[https://www.dokuwiki.org/syntax|syntax]] makes it easy to share ideas, collaborate on projects, and document workflows. The wiki keeps a record of every [[https://www.dokuwiki.org/attic|revision]] and a list of [[https://www.dokuwiki.org/recent_changes|recent changes]], enabling transparency, reproducibility, and provenance throughout the data analysis and research life cycle. The wiki also supports [[https://www.dokuwiki.org/syntax#syntax_highlighting|syntax highlighting]] of nearly any programming language (including Perl, Python, and R), uploading and embedding of [[https://www.dokuwiki.org/images|images]] and other media, and organization of pages and permissions via [[https://www.dokuwiki.org/namespaces|namespaces]].
-Interested in documenting your group's work using the wiki? [[http://www.google.com/recaptcha/mailhide/d?k=016RBE9TN7Hvz1Ju8fMFOpzA==&c=tKJn9IgI_y0sZjVgHqezgmc0aljyrZADEE2snwoA7FE=|Email Stephen Turner]], director of the [[http://bioinformatics.virginia.edu|Bioinformatics Core]], for an account.
+Interested in documenting your group's work using the wiki? [[http://www.google.com/recaptcha/mailhide/d?k=016RBE9TN7Hvz1Ju8fMFOpzA==&c=tKJn9IgI_y0sZjVgHqezgmc0aljyrZADEE2snwoA7FE=|Email Pankaj Kumar]], director of the [[http://bioinformatics.virginia.edu|Bioinformatics Core]], for an account.
-====== Example Entry ======
-The example entry below was originally part of a [[http://gettinggeneticsdone.blogspot.com/2012/03/pathway-analysis-for-high-throughput.html|tutorial on pathway analysis blog post on Getting Genetics Done]] by [[http://www.google.com/recaptcha/mailhide/d?k=016RBE9TN7Hvz1Ju8fMFOpzA==&c=tKJn9IgI_y0sZjVgHqezgmc0aljyrZADEE2snwoA7FE=|Stephen Turner]]
-===== Pathway Analysis: Gene expression and colon cancer susceptibility =====
-Data and background from: Hong Y, Ho KS, Eu KW, Cheah PY. A susceptibility gene set for early onset colorectal cancer that integrates diverse signaling pathways: implication for tumorigenesis. //Clin Cancer Res// 2007 Feb 15;13(4):1107-14. PMID: [[http://www.ncbi.nlm.nih.gov/pubmed/17317818|17317818]]. Download the original data on GEO using accession number: [[http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107|GSE4107]].
-==== Background ====
-Causative genes for autosomal dominantly inherited familial adenomatous polyposis (FAP) and hereditary non-polyposis colorectal cancer (HNPCC) have been well characterized. There is, however, another 10-15 % early onset colorectal cancer (CRC) whose genetic components are currently unknown. In this study, we used DNA chip technology to systematically search for genes differentially expressed in early onset CRC.
-==== Methods ====
-=== Overall design ===
-RNA extracted from colonic mucosa of healthy controls(10samples) and patients(12samples) were analyzed using GeneChip U133-Plus 2.0 Array. Patients and controls were age- (50 or less), ethnicity- (Chinese) and tissue-matched. T-test, hierarchical clustering, mean fold-change and principal component analysis were used to identify genes that differentiate between patients and controls. These were subsequently verified by real-time polymerase chain reaction (PCR) technology. Signaling Pathway Impact Analysis was used to perform a systems biology pathway analysis.
-=== R code ===
-<code rsplus examplecode.R>
-# These are bioconductor packages. See http://www.bioconductor.org/install/ for installation instructions
-library(Biobase)
-library(GEOquery)
-library(limma)
-library(SPIA)
-library(hgu133plus2.db)
-# load series and platform data from GEO:
-# http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE4107
-eset <- getGEO("GSE4107", GSEMatrix =TRUE)[[1]]
-# log transform
-exprs(eset) <- log2(exprs(eset))
-# set up a design matrix and contrast matrix
-design <- model.matrix(~0+as.factor(c(rep(1,12), rep(0,10))))
-colnames(design) <- c("cancer","normal")
-contrast.matrix <- makeContrasts(cancer_v_normal=cancer-normal, levels=design)
-# run the analysis with empirical Bayes moderated standard errors
-fit <- lmFit(eset,design)
-fit2 <- contrasts.fit(fit, contrast.matrix)
-fit2 <- eBayes(fit2)
-# get useful information for the top 25 genes
-top <- topTable(fit2, coef="cancer_v_normal", number=nrow(fit2), adjust.method="fdr")
-top <- na.omit(subset(top, select=c(ID, logFC, adj.P.Val)))
-top$ID <- as.character(top$ID)
-# annotate with entrez info
-top$ENTREZ<-unlist(as.list(hgu133plus2ENTREZID[top$ID]))
-top<-top[!is.na(top$ENTREZ),]
-top<-top[!duplicated(top$ENTREZ),]
-top$SYMBOL<-unlist(as.list(hgu133plus2SYMBOL[top$ID]))
-top<-top[!is.na(top$SYMBOL),]
-top<-top[!duplicated(top$SYMBOL),]
-# significant genes is a vector of fold changes where the names
-# are ENTREZ gene IDs. The background set is a vector of all the
-# genes represented on the platform.
-sig_genes <- subset(top, adj.P.Val<0.01)$logFC
-names(sig_genes) <- subset(top, adj.P.Val<0.01)$ENTREZ
-all_genes <- top$ENTREZ
-# run SPIA.
-spia_result <- spia(de=sig_genes, all=all_genes, organism="hsa", plots=TRUE)
-# Once you start running SPIA you'll see it go through all the KEGG pathways
-# for your organism. This will take a few minutes! Be patient.
-# Done pathway 1 : RNA transport..
-# Done pathway 2 : RNA degradation..
-# Done pathway 3 : PPAR signaling pathway..
-# Done pathway 4 : Fanconi anemia pathway..
-# Done pathway 5 : MAPK signaling pathway..
-# Done pathway 6 : ErbB signaling pathway..
-# Done pathway 7 : Calcium signaling pathway..
-# Done pathway 8 : Cytokine-cytokine receptor int..
-# Done pathway 9 : Chemokine signaling pathway..
-# Done pathway 10 : Neuroactive ligand-receptor in..
-head(spia_result)
-plotP(spia_result, threshold=0.05)
-</code>
-==== Results ====
-The output from Signaling Pathway Impact Analysis is a list of pathways, whether they're activated or inhibited, and three different p-values. p(NDE) is the p-value on the Number of Differentially Expressed genes. This is nearly identical to the GO-overrepresentation analysis - it's the significance of the over-representation of differentially expressed genes in the given pathway. The p(PERT) is the significance of the overall perturbation of the pathway, which takes into account topology. [[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732297/figure/F1/|Figure 1 in the SPIA paper]] explains this well. p(G) is the combined p-value from both p(NDE) and p(PERT). p(G)<sub>FDR</sub> is the FDR-corrected overall p-value. The SPIA plots show both the p(NDE) and p(PERT) on each axis, so the most significant things are up in the upper right corner. [[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732297/figure/F4/|Figure 4 in the SPIA paper]] explains this well.
-{{:spia-plot.png?nolink&|}}
-Limitations: It's worth noting that the method employed above has limitations. We don't fully understand biology, and our understanding of molecular networks and signaling pathways is still very low-resolution. We also don't have information about how different isoforms have different effects - which is something we'll get from RNA-seq experiments. Annotations are often incorrect and inaccurate, and we don't have very much cell-type specific or dynamic information about these pathways. Finally, the analysis of large gene lists with the current enrichment tools is still more of an exploratory data-mining procedure rather than a pure statistical solution and analytical endpoint.