Short-Read Processing Pipeline (Illumina)¶
Pipeline Workflow¶
The flowchart below summarizes the pipeline for processing short reads. VCF annotation is performed only when a GFF/GTF annotation file is provided. Delly and Freebayes are run exclusively for reference genome mapping; these steps are skipped when reads are mapped to a modified genome or a plasmid. For SV calls, vcf_to_table_short, build_sv_flank_bed, and mosdepth add 100 bp flank coverage metrics to the TSV output.
Directory Structure¶
This folder contains the full output of the Illumina short-read processing pipeline, including read quality control, trimming, genome mapping, and variant analysis.
data/outputs/illumina/
├── qc_trimming
│ ├── fastqc_out
│ ├── multiqc
│ └── trimmed_reads
├── short-mod
│ ├── bam
│ ├── bwa_index
│ ├── multiqc
│ ├── picard
│ ├── samtools_stats
│ └── unmapped_fastq
├── short-ref
│ ├── bam
│ ├── bcftools_stats
│ ├── bwa_index
│ ├── multiqc
│ ├── picard
│ ├── samtools_index_dict
│ ├── samtools_stats
│ ├── unmapped_fastq
│ └── vcf
└── short-ref-plasmid
├── bam
├── bwa_index
├── multiqc
├── picard
├── samtools_stats
└── unmapped_fastq
Output Subdirectories¶
qc_trimming/¶
This directory contains all quality control and preprocessing outputs generated from raw Illumina reads.
fastqc_out/- Raw read quality reports (per-sample) generated by FastQC.multiqc/- Aggregated quality control report summarizing all FastQC results and trimming reports.trimmed_reads/- Quality-filtered and adapter-trimmed reads used for downstream mapping.
short-ref/¶
This folder contains Illumina reads mapped to the reference genome.
Includes:
bam/- Sorted and indexed BAM alignment filesbwa_index/- Precomputed BWA reference genome indicessamtools_index_dict/- FASTA index and sequence dictionary filessamtools_stats/- Alignment and coverage statisticspicard/- Alignment QC metricsbcftools_stats/- Variant calling summary statisticsvcf/- Variant calls generated by Delly (SVs) and FreeBayes (SNP and INDELs) and annotated VCF (SNP and INDELs), if GFF or GTF files are present indata/validmultiqc/- Combined QC report from mapping and alignment metrics and variant calling metricsunmapped_fastq/- FASTQ file with reads that failed to align to the reference genome
short-ref-plasmid/¶
This folder holds the mapping results of Illumina reads aligned to the reference plasmid fasta. It is created only if a reference plasmid is present in the data/valid folder. A folder with a similar structure, short-mod-plasmid/, is created if a modified plasmid is present within the data/valid folder.
Includes:
bam/- Aligned reads (that were not mapped to the reference) mapped to the plasmidbwa_index/- Plasmid reference index filessamtools_stats/- Mapping and coverage statisticspicard/- Alignment QC metricsmultiqc/- Summary report of mapping and alignment metricsunmapped_fastq/- Reads not mapped to the plasmid and not mapped to the reference genome
short-mod/¶
This folder contains Illumina read alignments against the modified/assembled genome.
Includes:
bam/- Sorted BAM files for modified genome mappingbwa_index/- Modified genome BWA indexsamtools_stats/- Mapping and coverage statisticspicard/- Alignment QC metricsmultiqc/- Summary report of mapping and alignment metricsunmapped_fastq/- FASTQ file containing reads that failed to align to the modified genome
Tools Used¶
| Tool | Link for Further Information |
|---|---|
| Trim Galore | Trim Galore |
| FastQC | FastQC |
| MultiQC | MultiQC |
| BWA | BWA |
| Picard | Picard |
| Samtools | Samtools |
| BCFtools | BCFtools |
| FreeBayes | FreeBayes |
| SnpEff | SnpEff |
| Delly | Delly |
Citation¶
-
Twelve years of SAMtools and BCFtools Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
-
Tobias Rausch, Thomas Zichner, Andreas Schlattl, Adrian M. Stuetz, Vladimir Benes, Jan O. Korbel. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012 Sep 15;28(18):i333-i339. https://doi.org/10.1093/bioinformatics/bts378
-
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672
-
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012
-
Philip Ewels, Måns Magnusson, Sverker Lundin, Max Käller, MultiQC: summarize analysis results for multiple tools and samples in a single report, Bioinformatics, Volume 32, Issue 19, October 2016, Pages 3047–3048, https://doi.org/10.1093/bioinformatics/btw354
-
Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
-
Heng Li, Richard Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform, Bioinformatics, Volume 25, Issue 14, July 2009, Pages 1754–1760, https://doi.org/10.1093/bioinformatics/btp324
-
MARTIN, Marcel. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, [S.l.], v. 17, n. 1, p. pp. 10-12, may 2011. ISSN 2226-6089. Available at: https://journal.embnet.org/index.php/embnetjournal/article/view/200. Date accessed: 14 dec. 2025. doi:https://doi.org/10.14806/ej.17.1.200.
See Also¶
- Long-Read Processing Pipeline (PacBio & Oxford Nanopore) - PacBio and ONT results
- Unmapped Statistics - Detailed unmapped read analysis
- Truvari Comparison - Variant comparison results