Reliably transferring large amounts of data using rsync and pattern-matching

A simple how-to on remote data tranfers for bioinformaticians

A very common bioinformatic procedure is transferring files in different directories between computers. Often, it’s not as simple as running the exact same command each time, and there are slight modifications needed to make sure the correct files get transferred. The fundamentals however are very simple, and are generally very consistent. Here’s an explainer on how to use screen, rsync and pattern matching methods to make sure that specific files and directories get transferred reliably and efficiently.


Dependancies

If you’re primarly working on a MacOS or Linux machine, the screen and rsync utilities are most likely pre-installed. On the off chance they aren’t, they can be easily installed:

sudo apt update

sudo apt install screen rsync


Setting up a screen session

screen is a utility that lets you run several separate commands from a single terminal session. These commands will continue to run as long as the screen is active. Say you’re running running a command in a screen set up on a remote computer from your laptop at home. You could turn off your laptop, and the command will still run remotely with no issues.

We’d want to set up long-running jobs, such as large file transfers, inside a screen session if we anticipate it might take hours or days. Any interruptions that occur can cause file corruption that becomes annoying to handle. Setting up and managing screens is relatively easy, and only requires a few commands to remember:

1. Creating a screen session (this will automatically enter the screen session)

# name your screen something useful, especially so other users can know what its running
screen -S name_of_screen

2. Detaching from a screen

Control + A + D

3. Listing active screens

# this will list any active screens, their numerical ID, and whether you are attached or detached to them
screen -ls 

4. Reattaching to a screen

screen -r name_of_screen_or_screen_ID

5. Kill/Quit a screen

# Will end the screen only in that window (most common use case, typically the safer option)
screen -XS name_of_screen_or_screen_ID kill

# Will end the screen across all windows
screen -XS name_of_screen_or_screen_ID quit

Memorize these commands, which will quickly become 2nd nature, and you’re good to go.


rsync basics

rsync is a file transfer program for efficient transfers of data. I almost always only use the same 5 options, detailed below.

Option Definition
-P keeps partially transferred files (meaning you can easily restart the transfer) and shows a progress bar
-a archive, preserves many important attributes of the files
-v verbose, outputs more details on the transfer
-z compresses files during transfer
-h makes outputs like filesizes human readable

Occasionally, I will also use -r or -n:

Option Definition
-r recurse into directories, if you’re copying directories
-n dry run, if you want to see what will happen without running the transfer


The very basic syntax for a remote transfer of a file or files is as follows:

# File(s) transfer
rsync -Pavzh /source_dir/ remote_username@remote_domain_or_IP:/dest_dir/


Selective file transfer with find and wildcard expansion

Selectively copying files is immensely useful and will save you a lot of time. Here are some common use cases:

Transfer only some files from a single directory

# Use wildcard expansion to specify filenames. This will find files that match the pattern between the curly brackets and transfer them.

rsync -Pavzh {Sample1*,Sample2*,Sample3*} remote_username@remote_domain_or_IP:/dest_dir/

Transfer only certain filetypes, located in multiple directories

# The additional --no-relative and --no-dirs options effectively flattens the directory structure, putting all files into a single parent directory
# This command finds all "fastq.gz" files in multiple directories and transfers them

rsync -Pavzh `find /source_dir_1/ /source_dir_2/ -type f -name "*.fastq.gz"\` --no-relative --no-dirs remote_username@remote_domain_or_IP:/dest_dir/

Transfer files matching patterns found in different directories

rsync -Pavzh --no-relative --no-dirs `find . -type f \( -name "Sample1*" -o -name "Sample2*" -o -name "Sample3*" \)` remote_username@remote_domain_or_IP:/dest_dir/

Where you want to search for files can be changed. Note that it is specified as two different directories in the 2nd example (‘source_dir_1’ and ‘source_dir_2’) and the current working directory in the 3rd (‘.’).