Visualizing read alignment data with ggplot

Methods to make publication-ready genomic coverage figures in R

Read coverage plots are a readily interpretable way to visualize genomic or epigenomic profiles (RNA-seq, ChIP-seq, ATAC-seq, WGS, etc.) across many samples mapped to the same reference. See some examples here, here, and here. A common tool to visualize genomic data in this manner is IGV, which while versatile, often can be challenging to customize for publication-ready figures. Here is a tutorial on using ggplot and R to have much more artistic control over genomic coverage figures. The structure of intermediate objects will be shown so they can be easily replicated with custom data.

Disclaimer: this is not a ggplot or dplyr tutorial, but rather an example of an application of these libraries for visualizating quantitative genomic mapping data.

The R script used in this tutorial can be found here: genome-coverage-minimal-example.R.


R packages

There are just two required packages for this tutorial: ggcoverage, tidyverse.

If you want to do a bit more figure customization: scales, gggenes and RColorBrewer are used in the accompanying script.


Making coverage plots

ggcoverage is a package that has its own genomic coverage visualization implementation, but you’ll get more flexibility with ggplot. It does however have a nice function, used in this guided example, to import many commonly used alignment data formats (bam, wig, bw, bedgraph, txt) efficiently, without putting too much stress on memory requirements, meaning it can be run locally quite easily.

I’m using data from a recent publication representing chromatin accessibility (ATAC-seq) data that was mapped to a common reference genome, producing genomic alignment files (bam format).


1. Set up metadata, and load in coverage files

The first step is to configure a metadata object to let ggcoverage know where to look for data. You can include any other useful data attributes that you want to represent in the visualization such as for grouping or colouring.

# this is a folder with .bam files (multi-sample ATAC-seq mapped to common reference)
track.folder = 'path/to/bams/' 

# Create metadata from .bam names
# sample.meta = data.frame(SampleName = gsub('.bam', '' , grep('bam$', list.files(track.folder), value = T)))
# sample.meta is just a 1 column df with basenames of bam files (no .bam extension), equivalent to:

sample.meta <- data.frame(
  stringsAsFactors = FALSE,
        SampleName = c("CV0857_viridis_North_M", "CV0985_concolor_Other_F",
                       "CV1081_viridis_Mid_M", "CV1082_viridis_South_M", 
                       "CV1086_viridis_South_M", "CV1087_viridis_North_F",
                       "CV1089_viridis_South_M", "CV1090_cerberus_Other_M", 
                       "CV1095_viridis_North_M", "CV1096_viridis_North_F")
)

# Add more attributes to the metadata
sample.meta <- sample.meta %>% 
  mutate(SampleID = str_extract(SampleName, "^[^_]+")) %>% 
  mutate(Species = str_extract(SampleName, 'viridis|concolor|cerberus')) %>% 
  arrange(match(Species, c('viridis', 'concolor', 'cerberus')))

The basic format for the metadata, sample.meta, should be like shown below. The first column needs to be the file name without the extension, then you can add more attributes as additional columns

SampleName SampleID Species
CV0857_viridis_North_M CV0857 viridis
CV1081_viridis_Mid_M CV1081 viridis
CV1082_viridis_South_M CV1082 viridis
CV1086_viridis_South_M CV1086 viridis
CV1087_viridis_North_F CV1087 viridis
CV1089_viridis_South_M CV1089 viridis
CV1095_viridis_North_M CV1095 viridis
CV0985_concolor_Other_F CV0985 concolor
CV1090_cerberus_Other_M CV1090 cerberus


2. Load regions of interest from alignment files

ggcoverage requires minor tweaks to arguments in the LoadTrackFile function depending on the input format. I’d use ?LoadTrackFile to understand how to read your in data correctly. I’ve shown an example with comments specifying differences for .bam and .bw filetypes.

In this example, we’re going to be focusing on a regulatory region of a gene of interest.

# load regions of interest from bam files (change this for custom data)
chrom = "scaffold-mi2"
start_pos = floor(8923875 / 100 ) * 100 # Rounds the start coordinate to the nearest 100
end_pos = ceiling(8924426 / 500 ) * 500 # Rounds the end coordinate to the nearest 500

track.df = LoadTrackFile(track.folder = track.folder, format = "bam", # change between bam/bw
                         bamcoverage.path = '/Users/sidgopalan/miniconda3/bin/bamCoverage', 
                         meta.info = sample.meta,
                         single.nuc = T, # change to T for BAM, F for BigWig
                         single.nuc.region = paste0(chrom, ':', start_pos, '-', end_pos) # bam, if single.nuc = T
                         #region = paste0(chrom, ':', start_pos, '-', end_pos) # bw
)

The resulting track.df object will provide scores (RPKM normalized by default, this can be changed) representing levels of coverage across genomic coordinates, either nucleotide-by-nucleotide or by larger bins depending on the options used in LoadTrackFile. Note that the attributes we added to sample.meta, SampleID and Species, are also included in this object. This will allow us to group and colour our plot. You may notice that this dataframe is already in tidy format, ideal for input to ggplot. See below for a snapshot of track.df:

seqnames start end score SampleID Species
scaffold-mi2 8924084 8924085 93 CV0857 viridis
scaffold-mi2 8924085 8924086 95 CV0857 viridis
scaffold-mi2 8924086 8924087 92 CV0857 viridis
scaffold-mi2 8924087 8924088 93 CV0857 viridis
scaffold-mi2 8924088 8924089 91 CV0857 viridis
scaffold-mi2 8924089 8924090 90 CV0857 viridis
scaffold-mi2 8924090 8924091 87 CV0857 viridis
scaffold-mi2 8924091 8924092 77 CV0857 viridis
scaffold-mi2 8924092 8924093 77 CV0857 viridis
scaffold-mi2 8924093 8924094 77 CV0857 viridis


3. Making a basic coverage plot

First is a demonstration of the most basic plot, then we can increase its complexity by building off the starter code.

The data used here represents ATAC-seq data from ten individuals of three species (viridis, concolor and cerberus). So for these plots, we want to split up coverage tracks to show each individual, and colour by species. This process involves “facet wrapping”. Here, we facet wrap by SampleID, then set fill=Species. We use geom_col because we want heights to represent data values.

# Basic plot
ggplot() +
  theme_minimal() +
  geom_col(data = track.df, aes(x = start, y = score, fill = Species)) +
  facet_wrap(~factor(SampleID, levels = sample.meta$SampleID)) +
  labs(x = "Position on scaffold-mi2", y = "ATAC-seq read density") +
  ggtitle("Regulatory region")

center-aligned-image

Not quite publication quality, so we can make a few more customizations.


4. More cutomization with ggplot

ggplot allows a great deal of customization and complexity. For example, we can add the following:

  • Single column facet
  • Changed colours, axes labels and legend (requires scales)
ggplot() +
  theme_minimal() +
  geom_col(data = track.df, aes(x = start, y = score, fill = Species)) +
  facet_wrap(~factor(SampleID, levels = sample.meta$SampleID), ncol = 1, strip.position = "right") + # can use scales = "free_y" if you want to have each track to have independently determined y axes
  scale_x_continuous(limits = c(start_pos,end_pos), expand = c(0,0), labels = label_number(scale = 1e-6), breaks = pretty_breaks(n=3)) +
  scale_y_continuous(breaks = pretty_breaks(n=2)) +
  theme(legend.position = "bottom", # the following 3 lines customize the legend
        legend.direction = "horizontal",
        plot.margin = margin(10,5,5,5),
        ) +
  guides(fill = guide_legend(nrow = 1, title = NULL)) + 
  scale_fill_manual(values = c("viridis" = "#2E8B58", # the following 7 lines customize the fill legend
                               "concolor" = "#D9A528",
                               "cerberus" = "black"),
                    labels = c(expression(italic("C. viridis")),
                               expression(italic("C. o. concolor")),
                               expression(italic("C. o. cerberus"))),
                    breaks = c("viridis", "concolor", "cerberus")) +
  labs(x = paste0("pos. on ", chrom, ' (Mb)'), y = "ATAC-seq read density") +
  ggtitle("Regulatory region")
center-aligned-image
Slightly more customized plot.


5. Adding annotations to ggplot

You may want to add data annotation to your plot, representing other biological/genomic information. Adding annotations to ggplot is easy; it’s just a matter of creating new data frames with the information you want to plot, then setting the appropriate x, y, and potentially fill and color values in the aes() calls to the various geoms.

Lets add the following (this data is made up for the sake of illustration):

  • A translucent gray region representing a peak region (such as those called by peak-calling algorithms)
  • An arrow representing the gene locus (uses gggenes)
  • Transcription factors binding sites in each individual (coloured with RColorBrewer)
  • A Hi-C contact curve
# ATAC-seq peak
peaks <-
  data.frame(
    start = c(8923901L),
    end = c(8924400L)
  )

# Gene locus
arrow <- data.frame(
  stringsAsFactors = FALSE,
             start = c(8924000L),
               end = c(8924410L),
          SampleID = c("CV0857") # We want SampleID here so we plot this on the top-most sample
)


# Random transcription factor binding sites (TFBSs) data
TFBSs <- data.frame(
  pos = sample(8923901:8924400, 30, replace = TRUE),
  TF = sample(LETTERS[1:5], 30, replace = TRUE),
  SampleID = sample(sample.meta$SampleID, 30, replace = TRUE)
)

# Hi-C contact
HiC <- data.frame(x1 = 8924000, x2 = 8924410, y1 = 0, y2 = 0, SampleID = "CV1090")

ggplot() +
  theme_minimal() +
  geom_col(data = track.df, aes(x = start, y = score, fill = Species)) +
  annotate(geom = "rect",
           xmin = peaks$start,
           xmax = peaks$end,
           ymin = -Inf,
           ymax = +Inf,
           alpha = 0.2) +
  geom_curve(data = HiC, aes(x = x1, y = y1, xend = x2, yend = y2), curvature = -0.2, color = "darkorchid3") + # make a curve showing a Hi-C contact
  geom_point(data = TFBSs, aes(x = pos, y = 100, color = TF)) + # Add coloured dots representing TFBSs
  geom_gene_arrow(data = arrow, aes(xmin = start, xmax = end, y = 180, fill = 'blue'), arrowhead_height = grid::unit(2, "mm"), arrow_body_height = grid::unit(1, "mm")) + # customize the gene arrow
  facet_wrap(~factor(SampleID, levels = sample.meta$SampleID), ncol = 1, strip.position = "right") + # can use scales = "free_y" if you want to have each track to have independently determined y axes
  scale_x_continuous(limits = c(start_pos,end_pos), expand = c(0,0), labels = label_number(scale = 1e-6), breaks = pretty_breaks(n=3)) + # use pretty_breaks to customize breakpoints in the axes
  scale_y_continuous(breaks = pretty_breaks(n=2)) +
  theme(legend.position = "bottom", # the following 4 lines customize the legend
        legend.direction = "horizontal",
        legend.box = "vertical", # stack the 2 legends vertically
        plot.margin = margin(10,5,5,5),
  ) +
  guides(fill = guide_legend(nrow = 1, title = NULL), # remove legend titles
         color = guide_legend(nrow = 1, title = NULL)) + 
  scale_fill_manual(values = c("viridis" = "#2E8B58", # the following 7 lines customize the fill legend
                               "concolor" = "#D9A528",
                               "cerberus" = "black"),
                    labels = c(expression(italic("C. viridis")),
                               expression(italic("C. o. concolor")),
                               expression(italic("C. o. cerberus"))),
                    breaks = c("viridis", "concolor", "cerberus")) +
  scale_color_manual(values = brewer.pal(5, "Dark2")) + # add custom colours for the TFBSs
  labs(x = paste0("pos. on ", chrom, ' (Mb)'), y = "ATAC-seq read density") +
  ggtitle("Regulatory region")
center-aligned-image
Even more customized plot, with four types of annotated data.


Additional steps

Ultimately ggplot offers far more customization potential than IGV, making it more useful for making publication-ready figures.

Because the track.df object imported by ggcoverage::LoadTrackFile is already tidy formatted, this makes it incredibly easy to augment and manipulate with dplyr. For instance, you can calculate coverage statistics across some or all samples, and plot that in addition to coverage by creating an additional variable to facet by. You can also add other plots using cowplot, such as expression profiles, right next to coverage plots. See a demonstration of these figures here.

There are limits to ggplot, and there may be the need to touch up figures in other vector graphics tools such as Adobe Illustrator or InkScape.