Short-Read Processing Pipeline (Illumina)¶

Pipeline Workflow¶

The flowchart below summarizes the pipeline for processing short reads. VCF annotation is performed only when a GFF/GTF annotation file is provided. Delly and Freebayes are run exclusively for reference genome mapping; these steps are skipped when reads are mapped to a modified genome or a plasmid. For SV calls, vcf_to_table_short, build_sv_flank_bed, and mosdepth add 100 bp flank coverage metrics to the TSV output.

%%{init: { "theme": "base", "themeVariables": { "primaryColor": "#B6ECE2", "primaryTextColor": "#160F26", "primaryBorderColor": "#065647", "lineColor": "#545555", "clusterBkg": "#BABCBD22", "clusterBorder": "#DDDEDE", "fontFamily": "arial" } }}%% flowchart TB %% ===== INPUTS ===== RAW["Raw Illumina Reads"] REF["Reference FASTA"] PLASMID_REF["Plasmid FASTA"] GFF["GFF / GTF Annotation"] CONFIG["SnpEff Config"] style RAW fill:#E3F2FD,stroke:#1565C0,stroke-width:2px style REF fill:#E3F2FD,stroke:#1565C0,stroke-width:2px style PLASMID_REF fill:#E3F2FD,stroke:#1565C0,stroke-width:2px style GFF fill:#E3F2FD,stroke:#1565C0,stroke-width:2px style CONFIG fill:#E3F2FD,stroke:#1565C0,stroke-width:2px %% ===== QC ===== TRIM["TrimGalore"] FASTQC["FastQC"] MULTIQC_QC["MultiQC (QC)"] RAW --> TRIM --> FASTQC --> MULTIQC_QC %% ===== QC OUTPUTS ===== TRIMMED["Trimmed FASTQ"]:::output FASTQC_OUT["FastQC report"]:::output MULTIQC_QC_OUT["MultiQC QC report"]:::output TRIM --> TRIMMED FASTQC --> FASTQC_OUT MULTIQC_QC --> MULTIQC_QC_OUT %% ===== SHORT REF PIPELINE ===== BWA_INDEX["BWA index"] BWA["BWA mapping"] SORT["Samtools sort"] STATS["Samtools stats"] BAM_IDX["BAM index"] PICARD["Picard metrics"] FREEBAYES["Freebayes variant calling"] BCFTOOLS["BCFtools stats"] BUILD_CFG["Build snpEff config"] SNPEFF["snpEff annotation"] GET_UNMAPPED["Get unmapped reads"] REF --> BWA_INDEX --> BWA TRIMMED --> BWA BWA --> SORT --> SORTED_BAM["Sorted BAM"]:::output SORT --> STATS --> STATS_OUT["Samtools stats"]:::output SORT --> BAM_IDX --> BAM_INDEX_OUT["BAM index"]:::output %% ===== SHORT REF SV PIPELINE (Delly) ===== DELLY["Delly (SV calling)"] BCF2VCF["Convert BCF to VCF"] VCF2TABLE_SHORT["vcf_to_table_short"] BUILD_BED_SHORT["build_sv_flank_bed"] MOSDEPTH_SHORT["mosdepth (100 bp flanks)"] SV_TSV_SHORT["SV TSV + flank coverage"]:::output BAM_IDX --> DELLY --> BCF2VCF --> SV_VCF["SV VCF"]:::output SV_VCF --> VCF2TABLE_SHORT --> BUILD_BED_SHORT --> MOSDEPTH_SHORT --> SV_TSV_SHORT BAM_IDX --> GET_UNMAPPED --> UNMAPPED_OUT["Unmapped reads FASTQ"]:::output BAM_IDX --> FREEBAYES --> VCF_RAW["Raw VCF"]:::output VCF_RAW --> BCFTOOLS --> BCF_STATS["BCFtools stats"]:::output GFF --> BUILD_CFG --> SNPEFF CONFIG --> BUILD_CFG VCF_RAW --> SNPEFF --> VCF_ANNOT["Annotated VCF"]:::output BAM_IDX --> PICARD --> PICARD_OUT["Picard metrics"]:::output %% ===== PLASMID PIPELINE ===== PL_INDEX["BWA index"] PL_BWA["BWA mapping"] PL_SORT["Samtools sort"] PL_STATS["Samtools stats"] PL_BAM_IDX["BAM index"] PL_PICARD["Picard metrics"] GET_UNMAPPED_PL["Get unmapped reads plasmid"] PLASMID_REF --> PL_INDEX --> PL_BWA UNMAPPED_OUT --> PL_BWA PL_BWA --> PL_SORT --> PL_SORTED_BAM["Sorted BAM (plasmid)"]:::output PL_SORT --> PL_STATS --> PL_STATS_OUT["Samtools stats (plasmid)"]:::output PL_SORT --> PL_BAM_IDX --> PL_BAM_INDEX_OUT["BAM index"]:::output PL_BAM_IDX --> PL_PICARD --> PL_PICARD_OUT["Picard metrics (plasmid)"]:::output PL_BAM_IDX --> GET_UNMAPPED_PL --> PL_UNMAPPED_OUT["Unmapped reads FASTQ (plasmid)"]:::output %% ===== STYLING ===== classDef input fill:#E3F2FD,stroke:#1565C0 classDef process fill:#B6ECE2,stroke:#065647 classDef output fill:#E8F5E9,stroke:#2E7D32 class TRIMMED,FASTQC_OUT,MULTIQC_QC_OUT,SORTED_BAM,STATS_OUT,BAM_INDEX_OUT,PICARD_OUT,MULTIQC_MAP_OUT,UNMAPPED_OUT,VCF_RAW,BCF_STATS,VCF_ANNOT,SV_VCF,PL_SORTED_BAM,PL_STATS_OUT,PL_PICARD_OUT,PL_MULTIQC_OUT,PL_UNMAPPED_OUT output %% ===== LEGEND ===== subgraph LEGEND["Legend"] L1["Input"]:::input L2["Process"]:::process L3["Output file"]:::output end

Directory Structure¶

This folder contains the full output of the Illumina short-read processing pipeline, including read quality control, trimming, genome mapping, and variant analysis.

data/outputs/illumina/
├── qc_trimming
│   ├── fastqc_out
│   ├── multiqc
│   └── trimmed_reads
├── short-mod
│   ├── bam
│   ├── bwa_index
│   ├── multiqc
│   ├── picard
│   ├── samtools_stats
│   └── unmapped_fastq
├── short-ref
│   ├── bam
│   ├── bcftools_stats
│   ├── bwa_index
│   ├── multiqc
│   ├── picard
│   ├── samtools_index_dict
│   ├── samtools_stats
│   ├── unmapped_fastq
│   └── vcf
└── short-ref-plasmid
    ├── bam
    ├── bwa_index
    ├── multiqc
    ├── picard
    ├── samtools_stats
    └── unmapped_fastq

Output Subdirectories¶

`qc_trimming/`¶

This directory contains all quality control and preprocessing outputs generated from raw Illumina reads.

fastqc_out/ - Raw read quality reports (per-sample) generated by FastQC.
multiqc/ - Aggregated quality control report summarizing all FastQC results and trimming reports.
trimmed_reads/ - Quality-filtered and adapter-trimmed reads used for downstream mapping.

`short-ref/`¶

This folder contains Illumina reads mapped to the reference genome.

Includes:

bam/ - Sorted and indexed BAM alignment files
bwa_index/ - Precomputed BWA reference genome indices
samtools_index_dict/ - FASTA index and sequence dictionary files
samtools_stats/ - Alignment and coverage statistics
picard/ - Alignment QC metrics
bcftools_stats/ - Variant calling summary statistics
vcf/ - Variant calls generated by Delly (SVs) and FreeBayes (SNP and INDELs) and annotated VCF (SNP and INDELs), if GFF or GTF files are present in data/valid
multiqc/ - Combined QC report from mapping and alignment metrics and variant calling metrics
unmapped_fastq/ - FASTQ file with reads that failed to align to the reference genome

`short-ref-plasmid/`¶

This folder holds the mapping results of Illumina reads aligned to the reference plasmid fasta. It is created only if a reference plasmid is present in the data/valid folder. A folder with a similar structure, short-mod-plasmid/, is created if a modified plasmid is present within the data/valid folder.

Includes:

bam/ - Aligned reads (that were not mapped to the reference) mapped to the plasmid
bwa_index/ - Plasmid reference index files
samtools_stats/ - Mapping and coverage statistics
picard/ - Alignment QC metrics
multiqc/ - Summary report of mapping and alignment metrics
unmapped_fastq/ - Reads not mapped to the plasmid and not mapped to the reference genome

`short-mod/`¶

This folder contains Illumina read alignments against the modified/assembled genome.

Includes:

bam/ - Sorted BAM files for modified genome mapping
bwa_index/ - Modified genome BWA index
samtools_stats/ - Mapping and coverage statistics
picard/ - Alignment QC metrics
multiqc/ - Summary report of mapping and alignment metrics
unmapped_fastq/ - FASTQ file containing reads that failed to align to the modified genome

Tools Used¶

Tool	Link for Further Information
Trim Galore	Trim Galore
FastQC	FastQC
MultiQC	MultiQC
BWA	BWA
Picard	Picard
Samtools	Samtools
BCFtools	BCFtools
FreeBayes	FreeBayes
SnpEff	SnpEff
Delly	Delly

Citation¶

Twelve years of SAMtools and BCFtools Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
Tobias Rausch, Thomas Zichner, Andreas Schlattl, Adrian M. Stuetz, Vladimir Benes, Jan O. Korbel. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012 Sep 15;28(18):i333-i339. https://doi.org/10.1093/bioinformatics/bts378
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012
Philip Ewels, Måns Magnusson, Sverker Lundin, Max Käller, MultiQC: summarize analysis results for multiple tools and samples in a single report, Bioinformatics, Volume 32, Issue 19, October 2016, Pages 3047–3048, https://doi.org/10.1093/bioinformatics/btw354
Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Heng Li, Richard Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform, Bioinformatics, Volume 25, Issue 14, July 2009, Pages 1754–1760, https://doi.org/10.1093/bioinformatics/btp324
MARTIN, Marcel. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, [S.l.], v. 17, n. 1, p. pp. 10-12, may 2011. ISSN 2226-6089. Available at: https://journal.embnet.org/index.php/embnetjournal/article/view/200. Date accessed: 14 dec. 2025. doi:https://doi.org/10.14806/ej.17.1.200.

Short-Read Processing Pipeline (Illumina)¶

Pipeline Workflow¶

Directory Structure¶

Output Subdirectories¶

qc_trimming/¶

short-ref/¶

short-ref-plasmid/¶

short-mod/¶

Tools Used¶

Citation¶

See Also¶

`qc_trimming/`¶

`short-ref/`¶

`short-ref-plasmid/`¶

`short-mod/`¶