Select plot 1:

Download image

Select plot 2:

Download image

v0.7.1

1- Introduction

Profile Explorer for Roadmap Epigenomics data is a web application to browse some results obtain from the Roadmap Epigenomics consortium. We are not affiliated to the Roadmap Epigenomics consortium, and this analyse is not endorsed by Roadmap Epigenomics.

Many resources allow exploration of Roadmap Epigenomics results one region at a time, such as the genome browser available through the Roadmap Epigenomics data portal. Through this app we propose two other complementary analyses:

  • Profiles of epigenetic marks near a list of genome features, one assay and cell type at a time.
  • Correlation between changes of an epigenetic mark deposition and changes of transcription level accross cell types.
Genome features available for now are:
  • Transcription Start Site (TSS), defined as the middle TSS for genes with several TSSs, using Gencode v29.
  • Transcription Terminasion Site (TTS) (a.k.a. Transcription End Sites), defined as the middle TTS for genes with several TTSs, using Gencode v29.
  • Middle exons, defined as protein coding gene exons that are neither first nore last exons of the sortest of the gene transcripts, taken from Gencode v29.
RNA-seq data of corresponding samples were processed using Salmon to obtain gene level TPM, (Transcript per Million, at the gene level : sum of the TPMs of the transcripts of said gene), exon level TPM (sum of the TPM of transcripts containing said exon), or exon inclusion ratio (ratio of exon level TPM divided by gene level TPM, *a.k.a.* Psi).

2- Data origin and processing

Scripts corresponding to the steps described bellow can be found in this repository: github.com/gdevailly/perepigenomicsAnalysis

2.1- Roadmap epigenomics data

For WGBS, we downloaded bigwig files of fractional methylation and read coverage from the Roadmap Epigenomics data portal. Histone and DNAse1 data were downloaded as consolidated, not subsampled, tagAlign files from the data portal. Gencode human annotation version 29 (main annotation file), were downloaded from their website as gff3 files. Reads from RNA-seq data were etrieved from the European Nucleotide Archive using this table as a reference.

2.2- Data processing

RNA sequencing analysis

RNA-sequencing reads were quantify using Salmon v12.0 in pseudoalignement mode on the human reference genome hg38 and annotations v29 provided by Gencode. Arguments validateMappings,seqBias, gcBias where on, with biasSpeedSamp equals to 5, and libType equals to A. For samples with biological replicate, the median expression value in TPM across samples was used for genes and transcripts. For each exons, exon inclusion ratios were computed using the sum of TPM of transcripts including the exon divided by the TPM value of the gene.

Epigenetic mark processing

For WGBS, tree different tracks were generated from the FractionalMethylation.bigwig files using bedtools and rtracklayer: number of CpG site per window, mean DNA methylation ratio per window, density of mCpG sites per window, using windows with a width of 250 base pairs, sliding by 100 base pairs. The WGBS coverage file was processed similarly to produced a fourth track serving as a control. DNAse1 and Histone tagAlign files were imported as is in R.

3- Using PEREpigenomics

The PEREpigenomicsis app is divided in 5 tabs:

Explore

The 'Explore' tab allows the browsing and downloading of many thousands plots linking epigenetic marks and gene expression features. The first option, '1- Order by:' allows users to browse the dataset selecting the epigenetic assays before or after selecting the cell type of interest. It does not change the list of available plots. The second option, '2- Focus on:' allow the user to focus on 3 different gene features, corresponding to 4 different orderings. 'Transcription Start Sites (by gene TPM)' will center the plots on gene starts, and sort genes from high expression on top to no expression on the bottom. 'Transcription Termination Sites (by gene TPM)' will center plots on gene ends, and sort genes from high expression on top to no expression on the bottom. 'Middle exons (by exon expression)' will center plots on the starts of protein coding genes' middle exons (neither first nor last exons), with highly expressed exons on top and unexpressed exons on the bottom. 'Middle exons (by inclusion ratio)' will center plots on the starts of protein coding genes' middle exons, with included exons on top and excluded exons on the bottom. The third option, '3- Choose an assay' allows the selection of the epigenetic mark of interest. The fourth option, '4- Choose a cell type' allows the selection of the cell type of interest. Option 1 will swap the order between option 3 and option 4. Option 4 will show only the data available according to the choice made at option 3. If the feature selected in option 2 is either 'Transcription Start Sites' or 'Transcription Termination Sites', the fifth option, '5- Choose a gene category' will restrict the plot to only genes belonging to a defined category: main Gencode gene types short (<1kb), long (>3kb) and intermediate size genes.

The plot can be read as followed:

Compare

The 'Compare' tab allows to view two different plots at the same time, a surpisingly powerfull tool to explore this dataset. The 'Plot 1' panel control the plot on the left of the screen, the 'Plot 2' pannel controls the plot on the right of the screen. This tab is more enjoyable viewed on a large screen. Note that panel controls in this tab can be moved by holding the mouse button.

Correlate

While the 'Explore' and 'Compare' tabs display all the genes within one cell type, the 'Correlate' tabs take the complementary perspective of comparing the same gene accross cell types. The 'Correlate' tab is subdivided in two sub-tabs: The 'Gene by Gene' sub-tab will display scatter plots of the gene expression levels (respectively exon expression levels or exons inclusion ratio) vs the amount of epigenetic marks at TSS, TTS or middle exons starts. The first option, '1- Summerise mark at:' controls the window of epigenetic marks summerisation: TSS (±500bp), TTS (±500bp), middle exons (±100bp). Middle exons can be sorted by expression level or inclusion ratio. The second option allows the choice of the epigenetic marks of interest. In the third option, '3- Search for a gene', users are requested to search for their gene of interest, either through there HUGO gene symbol, or through there ensembl number. A fourth option will appears with the different genes matching the user search terms. Exons are search as for the genes, with the exact exon then identified by genomic coordinates (in hg19). Once the gene/exon is selected, two interactive scatter plots will appears, one for the epigenetic mark of interest, the second for the matching control assay (WGBS coverage for WGBS data, Inputs for DNAse1 and ChIP-seq assays). Linear regression statistics for the epigenetic marks of interest will be displayed bellow the plot. Pop-up informations will display the cell code corresponding to cell types as defined by the Roadmap Epigenomics consortium. The 'Accross all genes' tabs will display the distribution of slopes according to the linear regression coefficient R2 for all the genes. The first option, '1- Summerise mark at:' controls the window of epigenetic marks summerisation: TSS (±500bp), TTS (±500bp), middle exons (±100bp). Middle exons can be sorted by expression level or inclusion ratio. The second option allows the choice of the epigenetic marks of interest. The third options allows to restrict the analyses to groups of genes belonging to a defined category: main Gencode gene types short (<1kb), long (>3kb) and intermediate size genes. This options is not present for middle exons, as all middle exons are here from protein coding genes.

About

The 'About' tab contains various informations and instructions concerning the PEREpigenomics application.

Available profiles

This table, which is fully searchable and downloadable, lists the available profiles in tabulations 'Explore' and 'Compare'.

4- Code availability

Source code of this application is available on the INRAE GitLab instance, along with processed data. Scripts used to download and process Roadmap Epigenomics data are available on GitHub.

5- How to cite?

A pre-print describing this application will soon be available on bioRxiv.

6- Thanks

This project was funded by: This work was made possible thanks to the help of Anna Mantsoki, Barry Horne, and Kjell Petersen.