Help document

1. The data pipeline of TFBSbank

2, Data collection

ChIP data were collected from modENCODE, ENCODE and UCSC.

3, Data analysis

There are 5 functional modules for the annotation of each TF binding profiles: 1) ‘basic analysis' for characterizing the length distribution, distance to transcription starting sites(TSS) and the associated genomics features(promoter, exon, intron and intergenic regions) of ChIP peaks; 2) ‘target analysis‘ for putative target gene assignment and Gene Ontology(GO) term/Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway enrichment; 3)‘co-binding analysis' for identifying potential TF responsive Cis-regulatory modules(CRMs) and putative cofactor/collaborators; 4)'known motif analysis' for investigating enriched known motifs in TF ChIP peaks; 5) 'de novo motif analysis' for scanning enriched de novo motifs in TF ChIP peaks.

4, Data searching


Two strategies to search the data

Initially, users can begin their query by selecting species and choosing the corresponding TF.

Alternatively, users can build the query by inputting keywords related to TF names and choosing the corresponding species.

After submitting the query, a new page containing the results will be returned.

5, Data viewing

To enhance user experience, we supplied an interface containing the list of 5 species accompanied with species names and other summary information, which were hyperlinked to a species-specific data browsing page with two options for users to view the data. One option is to view all the records of a particular species. Alternatively, a picture of the life cycle of the corresponding animals(fly and worm) was provided so that users can specify a certain development stage by clicking the symbols within the picture, at which point they would be navigated to a result page with the list of TFs filtered by the unique developmental stage(i.e., embryos, larva, pupa and adults).On the search result page, the key information of the retrieved TFs was summarized in a table including the species name, the TF name and their tissue origin. At the end of each row, there is a link to the results demonstration page for the viewing of functional annotation of the corresponding TFs.


6, Data demonstration

To demonstrate the analysis result, we created a PHP page named ‘result.php' from which the analysis results of 5 functional modules were presented(Figure 1d). In order to save webpage space and visualize those analysis results in an organized manner, we provide a flowchart of the 5 analysis on the top of the page(basic analysis, target analysis, co-binding analysis, known motif analysis and de novo motif analysis). By clicking the name of the analysis, the corresponding results would be shown under the flowchart.

7, Data downloading

On the top of each result page, there is a download button. Users can download the compressed result through clicking the button.

8, Methods

Data collection. ChIP-chip and ChIP-seq data were downloaded from ENCODE (human and mouse), modENCODE(fly and worm) and UCSC(yeast) respectively.

Basic analysis. The ChIP experimental design were extracted from the original data sources. Brief description of the TF (species name, gene ID, gene name, GO terms) were retrieved from Ensemble database using BioMart. The length distribution and distance to TSS were calculated using a custom R script.

Putative target genes assignment. The ChIP peaks were compared to transcript annotation database using a custom Perl script. 'TxDb.Dmelanogaster.UCSC.dm3.ensGene' , ' TxDb.Celegans.UCSC.ce6.ensGene ', 'TxDb.Mmusculus.UCSC.mm9.knownGene', 'TxDb.Hsapiens.UCSC.hg19.knownGene', 'TxDb.Scerevisiae.UCSC.sacCer3.sgdGene' were used to for the gene annotation of fly, worm, mouse, human and yeast respectively. To determine the genomic features of ChIP peaks, the locations of peaks were compared with the locations of UTR, intron, exon, promoter and intergenic regions. In total, 5 methods were used to assign target genes: 'physical overlap', 'nearest gene' and 'neighbor overlap'(1kb, 10kb, 100kb). 'Physic overlap' reports all genes directly overlapping with ChIP peaks. 'Nearest gene' reports all genes closest to the middle of ChIP peak regardless of distance. 'Neighbor overlap' reports all genes overlapping with the extensions of peaks center by certain distances(1kb, 10kb, 100kb). For instance, if a ChIP peak named 'ChIP-peak-001' is located at the coordinate 'chr2L: 300,000-301,000', then all the genes overlapping with this region will be considered as target of 'ChIP-peak-001' by 'physical overlap' method. The gene closest to the peak center 'chr2L: 300,500' will be considered as target of 'ChIP-peak-001' by 'nearest target' method. For the 'neighbor overlap method', we will extend the peak center by 1kb, 10kb and 100kb to each directions, and all the genes overlapping with the extensions will be considered as putative targets. Therefore, 5 sets of target genes were predicted based on those criteria, and they were called 'overlap_target', 'nearest_target', 'neighbor_1kb_target', 'neighbor_10kb_target' and 'neighbor_100kb_target' respectively.

Hypergeometric Tests for GO term and KEGG Pathway. Gene ontology(GO) term has been widely used for the consistent descriptions of gene products(26, 27). There are 3 independent sets of GO terms: 'molecular function', 'biological process' and 'cellular components'. In our project, we are mainly interested in 'biological process' as it is most informative in understanding the biological functions of a gene. Thus, all our analysis was performed to study 'biological process' related GO terms. Standard Hypergeometric tests were conducted to identify enriched GO terms in putative targets. Specifically, 'GOHyperGParams' method from the R package GOstats(28) was employed to identify over-represented GO terms and 'KEGGHyperGParams' method was utilized to reveal enriched KEGG pathways. After calculating P value through Hypergeometric tests for each term, we performed multiple test correction using Benjamini-Hochberg(BH) method to control the rate of type I errors. All the GO and KEGG terms with the adjusted P value less than 0.01 are regarded as significantly enriched.

Co-binding analysis. To investigate potential co-binding of TFs, we extracted experimentally confirmed TFBSs from RedFly(22) database and overlapped them with ChIP peaks. If the TFBSs directly overlapped the ChIP peaks, then those TFs will be considered as ‘overlapping TFs'. The interactions between different proteins in Drosophila have been investigated through yeast two-hybrid and co-AP/MS methods, which were presented in DroID. In our study, all proteins having protein-protein interactions (PPIs) with investigated TFs were extracted from DroID(29).TFs that are linked with known interactions from DroID were chosen to be ‘interacting TFs'. Then the intersection between 'overlapping TFs' and 'interacting TFs' gives rise to the subset of TFs which not only co-bind with the TF but also share direct protein-protein interaction with each other, thus they were considered as the candidates for putative cofactors/collaborators.

Motif enrichment scanning. The presence of known motifs in TF ChIP peaks was identified using PWMEnrich with default parameters. PWMEnrich is an R package for the motif scanning of DNA based upon Biostrings. As PWMEnrich only support the analysis for human, mouse and fly, we were only able to conduct known motif analysis in those three species. The de novo motif scanning was performed using from HOMER (25) . Specifically, the DNA sequences within 100 bp of the peak centers were extracted from genome and compared to the genome background sequences to reveal potential enriched de novo motifs, the reported motifs were ranked in ascending order by their P values.