Posts
Visualizing read alignment data with ggplot
Read coverage plots are a readily interpretable way to visualize genomic or epigenomic profiles (RNA-seq, ChIP-seq, ATAC-seq, WGS, etc.) across many samples mapped to the same reference. See some examples here, here, and here. A common tool to visualize genomic data in this manner is IGV, which while versatile, often can be challenging to customize for publication-ready figures. Here is a tutorial on using ggplot and R to have much more artistic control over genomic coverage figures. The structure of intermediate objects will be shown so they can be easily replicated with custom data.
Reliably transferring large amounts of data using rsync and pattern-matching
A very common bioinformatic procedure is transferring files in different directories between computers. Often, it’s not as simple as running the exact same command each time, and there are slight modifications needed to make sure the correct files get transferred. The fundamentals however are very simple, and are generally very consistent. Here’s an explainer on how to use screen
, rsync
and pattern matching methods to make sure that specific files and directories get transferred reliably and efficiently.
Speeding up local BLAST using GNU parallel
BLAST can be parallelized to greatly improve runtime. This may be needed if you are BLASTing a large query sequence set against a giant database. The specific problem I was faced with solving was to identify contaminant sequences (of non-eukaryotic origin), accessioned in NCBI’s nt
database, in a eukaryotic genome assembly stored locally. This procedure however can be easily modify for any use case (such as using different blast flavours, parameters, databases etc.) by changing the provided script.