1. Quick Start

This chapter guides you through setting up the environment and executing the automated pipeline.

Important!

In the following sections whenever a “parameter” in brackets {} is shown, the intention is to fill in your own filename or value. Each parameter will be explained in the section in detail.

Tip

Notice the small “Copy to Clipboard” button on the right hand side of each code chunk, this can be used to copy the code.

1.1 Singularity Setup

The workflow is distributed as a self-contained Singularity container image, which includes all necessary software dependencies and helper scripts.

Prerequisites: Singularity/Apptainer version 3.x or later must be installed on your system. If you are working with a High Performance Computing (HPC) system, this is likely already installed. Try writing singularity --help in your terminal (that’s connected to the HPC system) and see if the command is recognized.

1.2 Download the Image

Download the required workflow image file (imam_workflow.sif) directly through the terminal:

wget https://github.com/EMC-Viroscience/illumina-metagenomic-analysis-manual/releases/latest/download/imam_workflow.sif

Or go to the github page and manually download it there, then transfer it to your HPC system.

1.3 Verify Container

You can verify the download by checking the container version or starting an interactive shell.

# Check version
singularity run imam_workflow.sif --version

# Start interactive shell
singularity shell imam_workflow.sif

singularity shell imam_workflow.sif will drop you into a shell running inside the container. The conda environment needed for this workflow is automatically active on start-up of the interactive shell. All the tools of the conda environment will therefore be ready to use.

Please note that you do not have to run conda activate {environment} to activate the environment – everything is inside imam_workflow.sif. If you’re curious about the conda environment we’re using, you can check it out here

1.4 Project Setup

We use a tool called Snakemake to automate the analysis. To simplify the creation of the required project directory, the container includes a helper script called prepare_project.py. This script automates the creation of the project directory, the sample configuration file (sample.tsv), and the general settings configuration file (config.yaml), guiding you through each step with clear prompts and error checking.

Required files

Before running the setup script, ensure you have the following two files ready:

Diamond DB: A diamond database that will be used to annotate assembled contigs.
Reference genome: An indexed fasta file that will be used as a host filter for your reads.

Initializing the Project Directory

Use singularity exec to run the setup script. This will create your project folder and generate the sample.tsv, config.yaml and Snakefile required for the pipeline.

singularity exec \
  --bind /mnt/viro0002-data:/mnt/viro0002-data \
  --bind $HOME:$HOME \
  --bind $PWD:$PWD \
  imam_workflow.sif \
  python /prepare_project.py \
    -p {project.folder} \
    -n {name} \
    -r {reads} \
    --ref-genome {reference} \
    --diamond-db {database} \
    -t {threads}

{name}: The name of your study, no spaces allowed.
{project.folder}: The project folder where you run your workflow and store results.
{reads}: The folder that contains your raw .fastq.gz files. Raw read files must adhere to the naming scheme as described here.
{reference}: Absolute path pointing to your reference genome (.fna, .fasta, .fa).
{database}: Absolute path pointing to your diamond database (.dmnd).

Important

The --bind arguments are needed to explicitly tell Singularity to mount the necessary host directories into the container. The part before the colon is the path on the host machine that you want to make available. The path after the colon is the path inside the container where the host directory should be mounted.

As a default, Singularity often automatically binds your home directory ($HOME) and the current directory ($PWD). We also explicitly bind /mnt/viro0002-data in this example. If your input files (reads, reference, databases) or output project directory reside outside these locations, you MUST add specific --bind /host/path:/container/path options for those locations, otherwise the container won’t be able to find them.

After running the prepare_project.py helper script, you should have the following files in your project directory:

The sample.tsv should have 3 columns: sample (sample name), fq1 and fq2 (paths to raw read files). Please note that samples sequenced by Illumina machines can be ran across different lanes. In such cases, the Illumina software will generate multiple fastq files for each sample that are lane specific (e.g. L001 = Lane 1, etc). So you may end up with a sample.tsv file that contains samples like 1_S1_L001 and 1_S1_L002, even though these are the same sample, just sequenced across different lanes. The snakemake workflow will recognize this behaviour and merge these files together accordingly.
The config.yaml contains more general information like the indexed reference and database you supplied as well as the amount of default threads to use.
The Snakefile is the “recipe” for the workflow, describing all the steps we have done by hand, and it is most commonly placed in the root directory of your project (you can open the Snakefile with a text editor and have a look).

1.5 Executing the Pipeline

Once the project directory is initialized, navigate into it and run the workflow.

Navigate to the project:

cd {project.folder}

This folder should contain your Snakefile, sample.tsv and config.yaml files, which were generated during step 1.4.

Dry Run (Optional but Recommended): Check for errors without executing commands.

singularity exec \
  --bind /mnt/viro0002-data:/mnt/viro0002-data \
  --bind $HOME:$HOME \
  --bind $PWD:$PWD \
  imam_workflow.sif \
  snakemake --snakefile Snakefile --cores 1 --dryrun

Run the workflow: Remove --dryrun and set the number of threads

singularity exec \
  --bind /mnt/viro0002-data:/mnt/viro0002-data \
  --bind $HOME:$HOME \
  --bind $PWD:$PWD \
  imam_workflow.sif \
  snakemake --snakefile Snakefile --cores {threads}