Illumina metagenomic analysis manual

Author

Luc van Zon

Published

April 9, 2025

Introduction

Welcome to the Illumina metagenomic data analysis manual. This documentation covers the IMAM workflow: an illumina metagenomic pipeline designed for quality control, filtering host sequences, assembling reads into contigs, annotating the contigs, and then extracing viral contigs and their corresponding reads.

This manual is divided into two parts:

Quick Start: How to set up and run the automated pipeline immediately.
Manual Execution: A detailed breakdown of the underlying bioinformatic steps (Under the hood).

Workflow Summary

The pipeline performs the following steps:

Quality control
- Merging and decompressing with zcat.
- Deduplication: Remove duplicate reads from the uncompressed FASTQ files with cd-hit-dup.
- Quality trimming: Perform quality and sequence adapter trimming with fastp.
- Host filtering: Remove reads that map to a host genome (e.g., human) with bwa.
De novo assembly
- Perform de novo assembly of the host-filtered reads to create contigs with SPades.
Taxonomic classification
- Annotate the aggregated contigs by assigning taxonomic classifications to them based on sequence similarity to known proteins in a database using diamond blastx.
Extracting Viral Sequences and Analyzing Mapped Reads
- Extract contigs for annotated, unannotated and viral contigs.
- Map the quality-filtered and host-filtered reads back to the assembled contigs to quantify the abundance of different contigs in each sample.
- Extract and count mapped reads for each annotation file.