# Bioinformatics Core Wiki

wiki.bioconnector.virginia.edu

start

# BioConnector Wiki

The BioConnector Wiki is a standards compliant, simple to use wiki, mainly aimed at creating documentation. It's simple but powerful syntax makes it easy to share ideas, collaborate on projects, and document workflows. The wiki keeps a record of every revision and a list of recent changes, enabling transparency, reproducibility, and provenance throughout the data analysis and research life cycle. The wiki also supports syntax highlighting of nearly any programming language (including Perl, Python, and R), uploading and embedding of images and other media, and organization of pages and permissions via namespaces.

Interested in documenting your group's work using the wiki? Email Stephen Turner, director of the Bioinformatics Core, for an account.

# Example Entry

The example entry below was originally part of a tutorial on pathway analysis blog post on Getting Genetics Done by Stephen Turner

## Pathway Analysis: Gene expression and colon cancer susceptibility

Data and background from: Hong Y, Ho KS, Eu KW, Cheah PY. A susceptibility gene set for early onset colorectal cancer that integrates diverse signaling pathways: implication for tumorigenesis. Clin Cancer Res 2007 Feb 15;13(4):1107-14. PMID: 17317818. Download the original data on GEO using accession number: GSE4107.

### Background

Causative genes for autosomal dominantly inherited familial adenomatous polyposis (FAP) and hereditary non-polyposis colorectal cancer (HNPCC) have been well characterized. There is, however, another 10-15 % early onset colorectal cancer (CRC) whose genetic components are currently unknown. In this study, we used DNA chip technology to systematically search for genes differentially expressed in early onset CRC.

### Methods

#### Overall design

RNA extracted from colonic mucosa of healthy controls(10samples) and patients(12samples) were analyzed using GeneChip U133-Plus 2.0 Array. Patients and controls were age- (50 or less), ethnicity- (Chinese) and tissue-matched. T-test, hierarchical clustering, mean fold-change and principal component analysis were used to identify genes that differentiate between patients and controls. These were subsequently verified by real-time polymerase chain reaction (PCR) technology. Signaling Pathway Impact Analysis was used to perform a systems biology pathway analysis.

#### R code

examplecode.R
# These are bioconductor packages. See http://www.bioconductor.org/install/ for installation instructions
library(Biobase)
library(GEOquery)
library(limma)
library(SPIA)
library(hgu133plus2.db)

# load series and platform data from GEO:
# http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE4107
eset <- getGEO("GSE4107", GSEMatrix =TRUE)[[1]]

# log transform
exprs(eset) <- log2(exprs(eset))

# set up a design matrix and contrast matrix
design <- model.matrix(~0+as.factor(c(rep(1,12), rep(0,10))))
colnames(design) <- c("cancer","normal")
contrast.matrix <- makeContrasts(cancer_v_normal=cancer-normal, levels=design)

# run the analysis with empirical Bayes moderated standard errors
fit <- lmFit(eset,design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

# get useful information for the top 25 genes
top <- topTable(fit2, coef="cancer_v_normal", number=nrow(fit2), adjust.method="fdr")
top <- na.omit(subset(top, select=c(ID, logFC, adj.P.Val)))
top$ID <- as.character(top$ID)

# annotate with entrez info
top$ENTREZ<-unlist(as.list(hgu133plus2ENTREZID[top$ID]))
top<-top[!is.na(top$ENTREZ),] top<-top[!duplicated(top$ENTREZ),]
top$SYMBOL<-unlist(as.list(hgu133plus2SYMBOL[top$ID]))
top<-top[!is.na(top$SYMBOL),] top<-top[!duplicated(top$SYMBOL),]

# significant genes is a vector of fold changes where the names
# are ENTREZ gene IDs. The background set is a vector of all the
# genes represented on the platform.
sig_genes <- subset(top, adj.P.Val<0.01)$logFC names(sig_genes) <- subset(top, adj.P.Val<0.01)$ENTREZ
all_genes <- top\$ENTREZ

# run SPIA.
spia_result <- spia(de=sig_genes, all=all_genes, organism="hsa", plots=TRUE)

# Once you start running SPIA you'll see it go through all the KEGG pathways
# for your organism. This will take a few minutes! Be patient.
# Done pathway 1 : RNA transport..
# Done pathway 2 : RNA degradation..
# Done pathway 3 : PPAR signaling pathway..
# Done pathway 4 : Fanconi anemia pathway..
# Done pathway 5 : MAPK signaling pathway..
# Done pathway 6 : ErbB signaling pathway..
# Done pathway 7 : Calcium signaling pathway..
# Done pathway 8 : Cytokine-cytokine receptor int..
# Done pathway 9 : Chemokine signaling pathway..
# Done pathway 10 : Neuroactive ligand-receptor in..

plotP(spia_result, threshold=0.05)

### Results

The output from Signaling Pathway Impact Analysis is a list of pathways, whether they're activated or inhibited, and three different p-values. p(NDE) is the p-value on the Number of Differentially Expressed genes. This is nearly identical to the GO-overrepresentation analysis - it's the significance of the over-representation of differentially expressed genes in the given pathway. The p(PERT) is the significance of the overall perturbation of the pathway, which takes into account topology. Figure 1 in the SPIA paper explains this well. p(G) is the combined p-value from both p(NDE) and p(PERT). p(G)FDR is the FDR-corrected overall p-value. The SPIA plots show both the p(NDE) and p(PERT) on each axis, so the most significant things are up in the upper right corner. Figure 4 in the SPIA paper explains this well.

Limitations: It's worth noting that the method employed above has limitations. We don't fully understand biology, and our understanding of molecular networks and signaling pathways is still very low-resolution. We also don't have information about how different isoforms have different effects - which is something we'll get from RNA-seq experiments. Annotations are often incorrect and inaccurate, and we don't have very much cell-type specific or dynamic information about these pathways. Finally, the analysis of large gene lists with the current enrichment tools is still more of an exploratory data-mining procedure rather than a pure statistical solution and analytical endpoint.