1. Quick Start

This chapter guides you through setting up the environment and executing the automated pipeline.

Important!

In the following sections whenever a “parameter” in brackets {} is shown, the intention is to fill in your own filename or value. Each parameter will be explained in the section in detail.

Tip

Notice the small “Copy to Clipboard” button on the right hand side of each code chunk, this can be used to copy the code.

1.1 Singularity Setup

The workflow is distributed as a self-contained Singularity container image, which includes all necessary software dependencies and helper scripts.

Prerequisites: Singularity/Apptainer version 3.x or later must be installed on your system. If you are working with a High Performance Computing (HPC) system, this is likely already installed. Try writing singularity --help in your terminal (that’s connected to the HPC system) and see if the command is recognized.

1.2 Download the Image

Download the required workflow image file (naam_workflow.sif) directly through the terminal:

wget https://github.com/EMC-Viroscience/nanopore-amplicon-analysis-manual/releases/latest/download/naam_workflow.sif

Or go to the github page and manually download it there, then transfer it to your HPC system.

1.3 Verify Container

You can verify the download by checking the container version or starting an interactive shell.

# Check version
singularity run naam_workflow.sif --version

# Start interactive shell
singularity shell naam_workflow.sif

singularity shell naam_workflow.sif will drop you into a shell running inside the container. The conda environment needed for this workflow is automatically active on start-up of the interactive shell. All the tools of the conda environment will therefore be ready to use.

Please note that you do not have to run conda activate {environment} to activate the environment – everything is inside naam_workflow.sif. If you’re curious about the conda environment we’re using, you can check it out here

1.4 Project Setup

We use a tool called Snakemake to automate the analysis. To simplify the creation of the required configuration files, the container includes a helper script called amplicon_project.py.

Required Configuration Files

Before running the setup script, ensure you have the following two files ready:

virus_config.yaml: Defines parameters for the viruses you are analyzing.
sample_map.csv: Links your barcode directories to the specific virus ID defined in the config.

Example virus_config.yaml:

sars-cov-2:
  # Paths to reference and primer files
  reference_genome: /abs/path/to/reference.fasta
  primer: /abs/path/to/primer.fasta
  primer_reference: /abs/path/to/primer_reference.fasta
  
  # Required analysis parameters
  min_length: 250
  coverage: 30
  primer_allowed_mismatch: 2
  
  # Optional workflow steps
  run_nextclade: true
  nextclade_dataset: 'nextstrain/sars-cov-2/wuhan-hu-1/orfs' # Official Nextclade maintained dataset

measles:
  reference_genome: /path/to/reference.fasta
  primer: /path/to/primer.fasta
  primer_reference: /path/to/primer_reference.fasta
  min_length: 100
  coverage: 30
  primer_allowed_mismatch: 2
  run_nextclade: true
  nextclade_dataset: '/abs/path/to/custom/dataset' # Custom user created dataset

mpox:
  reference_genome: /path/to/reference.fasta
  primer: /path/to/primer.fasta
  primer_reference: /path/to/primer_reference.fasta
  min_length: 1000
  coverage: 30
  primer_allowed_mismatch: 2
  run_nextclade: false # Nextclade will not run for this virus
  nextclade_dataset: null

# ... add other viruses if needed.

Key parameters within each virus entry include:

reference_genome, primer, primer_reference: Absolute paths to the respective FASTA files.
min_length: The minimum read length to keep after QC. Must be below the expected amplicon size.
coverage: The minimum read depth required for consensus calling. 30x is a common minimum.
primer_allowed_mismatch: The number of mismatches allowed while matching and determining the position of the primers.
run_nextclade: Set to true to enable the Nextclade analysis for this virus, false otherwise.
nextclade_dataset: Path to a Nextclade dataset. This can be an official dataset name (the workflow will download it) or an absolute path to a custom dataset you have locally.

Example sample_map.csv:

barcode_dir,virus_id
barcode01,sars-cov-2
barcode02,sars-cov-2
barcode03,measles
barcode04,measles
barcode05,mpox

barcode_dir: Name of barcode directory containing raw fastq.gz files.
virus_id: Virus name, must match with names in the virus config.

Initializing the Project Directory

Use singularity exec to run the setup script. This will create your project folder and generate the sample.tsv and Snakefile required for the pipeline.

singularity exec \
  --bind /mnt/viro0002-data:/mnt/viro0002-data \
  --bind $HOME:$HOME \
  --bind $PWD:$PWD \
  naam_workflow.sif \
  python /amplicon_project.py \
    -p {project.folder} \
    -n {name} \
    -d {reads} \
    --virus-config {virus_config.yaml} \
    --sample-map {sample_map.csv}

Important

Please use absolute paths for the reads, virus_config.yaml and sample_map.csv so that they can always be located.

{project.folder}: The new directory where results will be stored.
{name}: Name of your study (no spaces).
{reads}: Folder containing your barcode subdirectories (e.g., barcode01).
{virus_config.yaml}: Path to your YAML config.
{sample_map.csv}: Path to your CSV map.

Important

The --bind arguments are needed to explicitly tell Singularity to mount the necessary host directories into the container. The part before the colon is the path on the host machine that you want to make available. The path after the colon is the path inside the container where the host directory should be mounted.

As a default, Singularity often automatically binds your home directory ($HOME) and the current directory ($PWD). We also explicitly bind /mnt/viro0002-data in this example. If your input files (reads, reference, databases) or output project directory reside outside these locations, you MUST add specific --bind /host/path:/container/path options for those locations, otherwise the container won’t be able to find them.

1.5 Executing the Pipeline

Once the project directory is initialized, navigate into it and run the workflow.

Navigate to the project:

cd {project.folder}

This folder should contain your Snakefile and sample.tsv files, which were generated during step 1.4.

Dry Run (Optional but Recommended): Check for errors without executing commands.

singularity exec \
  --bind /mnt/viro0002-data:/mnt/viro0002-data \
  --bind $HOME:$HOME \
  --bind $PWD:$PWD \
  naam_workflow.sif \
  snakemake --snakefile Snakefile --cores 1 --dryrun

Run the workflow: Remove --dryrun and set the number of threads

singularity exec \
  --bind /mnt/viro0002-data:/mnt/viro0002-data \
  --bind $HOME:$HOME \
  --bind $PWD:$PWD \
  naam_workflow.sif \
  snakemake --snakefile Snakefile --cores {threads}

Note

Directory Structure: Upon completion, your project folder will contain a results/ directory with subfolders for QC, consensus, variants, and nextclade (if enabled).