BLAST can be parallelized to greatly improve runtime. This may be needed if you are BLASTing a large query sequence set against a giant database. The specific problem I was faced with solving was to identify contaminant sequences (of non-eukaryotic origin), accessioned in NCBI’s nt
database, in a eukaryotic genome assembly stored locally. This procedure however can be easily modify for any use case (such as using different blast flavours, parameters, databases etc.) by changing the provided script.
Make sure you have plenty of available space on your machine (several Tb) before uncompressing any files from NCBI!
Large sequence databases
A problem that arises every once in a while is the need to blast very large query sets, such as a pre-assembly genome or transcriptome, against NCBI databases like the nt
database, which as of the time of this post, has an expanded size just shy of 2 Tb!
Blasting a large genome/transcriptome against this database can take days or weeks depending on your hardware. We can greatly speed up this process with GNU parallel
.
Dependancies
blast executables for the set of main blast algorithms. It can be easily installed on most Linux machines with conda
or mamba
(preferred nowadays). For example:
mamba install -c bioconda blast
seqkit is used to split a fasta into parts. It too can be installed with conda
or mamba
:
mamba install -c bioconda seqkit
GNU parallel lets you distribute jobs across different CPU cores. Install the appropriate version for your OS.
Downloading and setting up appropriate files
We’ll not only need to download the nt
database for this, but also a few taxonomy files from NCBI. This will allow us to access and parse taxonomy-related information from our BLAST search. All these taxonomy files get dumped into the nt/
directory for simplicity.
NCBI’s nucleotide database
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz # Remember, this is a huge file. This may take a while.
gunzip nt.gz # This will take up a lot of space!
Taxonomy files
## First, download and move the accession2taxid files (connects accessions from NT to an NCBI taxonomy ID)
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz
sed '1d' nucl_gb.accession2taxid | awk '{print $2" "$3}' > taxidmapfile # this needs to be done to convert the taxidmapfile to the correct format
mv taxidmapfile nt/
## Next, download and uncompress other taxonomy files. These encode lots of taxonomy information that can be accessed later through blast search parameters.
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mv taxdump.tar.gz nt/
tar -xvzf nt/taxdump.tar.gz
Making the blast database
makeblastdb -in nt/ -dbtype nucl -parse_seqids -taxid_map taxidmapfile
We’ve set up files and made the database incorporate encoded taxonomy data for every accession. Now we’re ready to run BLAST!
Running parallel blast, including taxonomy outputs
By default, the BLAST algorithm when run locally takes a single query sequence and performs the seeding/extension alignment for every sequence in the entire database. If we have a file with thousands of queries, like a pre-assembly genome or transcriptome, this will take a very long time. The basis of this script is to split up and distribute a BLAST problem using seqkit split
and GNU parallel
. If you have the computational resources, different cores will partition the BLAST task to run in parallel, speeding it up greatly.
Because we have also set up taxonomy information in the nt/
database for retrieval, we will be able to extract taxonomy-relevant information from our search. Below are some of the options I use when querying, a full list is found here.
Option | Definition |
---|---|
sacc | subject accession |
staxids | unique subject taxonomy ID(s) |
sskingdoms | unique subject super kingdom(s) |
sscinames | unique subject scientific name(s) |
scomnames | unique subject common name(s) |
Parallel BLAST script
The actual BLAST command is near the end of the script, and its options can be modified in any way you choose.
#!/bin/bash
genome=""
num_cores=1
NT=""
display_help() {
echo "Usage: $0 -g <genome.fasta> -n <NT_path> -p <num_cores>"
echo "Options:"
echo " -g <genome.fasta>: Path to the genome FASTA file"
echo " -n <NT_path>: Path to the NCBI nucleotide NT database (with TXDB files inside also)"
echo " -p <num_cores>: Number of CPU cores to use for parallelization (the default is 1)"
echo " -h: Display help message"
}
while getopts ":g:n:p:h" opt; do
case $opt in
g)
genome="$OPTARG"
;;
n)
NT="$OPTARG"
;;
p)
num_cores="$OPTARG"
;;
h)
display_help
exit 0
;;
:)
echo "Error: Option -$OPTARG requires an argument."
exit 1
;;
\?)
echo "Error: Invalid option -$OPTARG"
exit 1
;;
esac
done
if [[ -z $genome || -z $NT ]]; then
echo "Error: Some options are missing."
display_help
exit 1
fi
# Export BLASTDB variable
export BLASTDB="$NT"
# Split the genome FASTA into individual sequences
num_sequences=$(grep -c ">" "$genome")
seqkit split -i -p "$num_sequences" "$genome" || exit 1
# Run BLAST in parallel for each sequence subset
ls "${genome}.split"/*.fasta | parallel -j "$num_cores" \
'blastn \
-task megablast \
-query {} \
-db nt \
-outfmt "6 qseqid sacc staxids sskingdoms sscinames scomnames evalue bitscore" \
-max_target_seqs 1 \
-max_hsps 1 \
-num_threads 32 \
-evalue 1e-25' \
> $(basename "$genome").vs.nt.mts1.hsp1.1e25.megablast.out
# Clean up temp files
rm -rf "${genome}.split"
Output
An example output found in the megablast.out
file is shown below. The columns correspond to options set with the -outfmt
parameter.
Scaffold_31_HRSCAF_176 MT070612 8732 Eukaryota Crotalus durissus terrificus tropical rattlesnake 0.0 6948
Scaffold_3_HRSCAF_30 OX365972 103942 Eukaryota Vipera ursinii Vipera ursinii 0.0 4106
Scaffold_24_HRSCAF_168 MT019621 8730 Eukaryota Crotalus atrox western diamondback rattlesnake 0.0 8527
Scaffold_19_HRSCAF_159 OX365970 103942 Eukaryota Vipera ursinii Vipera ursinii 0.0 8347
Scaffold_22_HRSCAF_162 OX365981 103942 Eukaryota Vipera ursinii Vipera ursinii 0.0 3099
Quickly Control + F or ⌘ + F this file to see if there are any non-eukaryotic sequences in your assembly. If so, any non-eukaryotic contaminant sequences that were assembled can easily be identified and removed:
awk -F'\t' '$4 == "Eukaryota" {print $1}' megablast.out > eukaryotic_sequences.txt
seqkit grep -f eukaryotic_sequences.txt genome.fasta > no_contaminants_genome.fasta
Data compression
NCBI databases are huge, and you don’t want them to take up that much space when not in use. Daren Card’s blog post on this topic goes over many details on how to compress large directories after you’re done using them.