Skip to content

Long-Read Processing Pipeline (PacBio & ONT)

Pipeline Workflow

This workflow shows the processing of raw long-read sequencing data (PacBio or Nanopore) from quality control to mapping. Reads undergo NanoPlot QC, then mapped to the reference or modified genome with minimap2, followed by sorting, indexing, and calculation of unmapped reads. Structural variant calling using cute_sv, debreak, and sniffles is performed only for reads mapped to the reference genome, and results are merged with SURVIVOR and summarized with bcftools stats, producing the final long-read VCF. Reads mapped to modified or plasmid sequences skip structural variant calling. For SV calls, vcf_to_table_long, build_sv_flank_bed, and mosdepth add 100 bp flank coverage metrics to the TSV output.

%%{init: { "theme": "base", "themeVariables": { "primaryColor": "#B6ECE2", "primaryTextColor": "#160F26", "primaryBorderColor": "#065647", "lineColor": "#545555", "clusterBkg": "#BABCBD22", "clusterBorder": "#DDDEDE", "fontFamily": "arial" } }}%% flowchart TB %% ===== INPUTS ===== LONG_READS["Raw Long Reads (PacBio / Nanopore)"] REF["Reference FASTA"] PLASMID_REF["Plasmid FASTA"] style LONG_READS fill:#E3F2FD,stroke:#1565C0,stroke-width:2px style REF fill:#E3F2FD,stroke:#1565C0,stroke-width:2px style PLASMID_REF fill:#E3F2FD,stroke:#1565C0,stroke-width:2px %% ===== QC ===== NANO_PLOT["NanoPlot QC"] MULTIQC_QC["MultiQC (QC)"] NANO_PLOT_OUT["NanoPlot QC report"]:::output MULTIQC_QC_OUT["MultiQC QC report"]:::output LONG_READS --> NANO_PLOT --> MULTIQC_QC NANO_PLOT --> NANO_PLOT_OUT MULTIQC_QC --> MULTIQC_QC_OUT %% ===== LONG REF MAPPING PIPELINE ===== MINIMAP2["minimap2 mapping"] SORT_BAM["Samtools sort"] SORTED_BAM["Sorted BAM"]:::output BAM_IDX["Samtools index BAM"] BAM_IDX_OUT["BAM index"]:::output GET_UNMAPPED["Get unmapped reads"] UNMAPPED_OUT["Unmapped reads FASTQ"]:::output REF --> MINIMAP2 LONG_READS --> MINIMAP2 MINIMAP2 --> SORT_BAM --> SORTED_BAM SORT_BAM --> BAM_IDX --> BAM_IDX_OUT BAM_IDX --> GET_UNMAPPED --> UNMAPPED_OUT %% ===== PLASMID PIPELINE ===== MINIMAP2_PLASMID["minimap2 mapping (plasmid)"] SORT_BAM_PLASMID["Samtools sort (plasmid)"] SORTED_BAM_PLASMID["Sorted BAM (plasmid)"]:::output BAM_IDX_PLASMID["Samtools index BAM (plasmid)"] BAM_IDX_PLASMID_OUT["BAM index (plasmid)"]:::output GET_UNMAPPED_PL["Get unmapped plasmid reads"] UNMAPPED_PL_OUT["Unmapped plasmid reads FASTQ"]:::output PLASMID_REF --> MINIMAP2_PLASMID UNMAPPED_OUT --> MINIMAP2_PLASMID MINIMAP2_PLASMID --> SORT_BAM_PLASMID --> BAM_IDX_PLASMID --> GET_UNMAPPED_PL SORT_BAM_PLASMID --> SORTED_BAM_PLASMID BAM_IDX_PLASMID --> BAM_IDX_PLASMID_OUT GET_UNMAPPED_PL --> UNMAPPED_PL_OUT %% ===== SV CALLING PIPELINE ===== CUTE_SV["cute_sv"] DEBREAK["debreak"] SNIFFLES["sniffles"] SURVIVOR["survivor"] BCFTOOLS_STATS["bcftools stats"] VCF2TABLE_LONG["vcf_to_table_long"] BUILD_BED_LONG["build_sv_flank_bed"] MOSDEPTH_LONG["mosdepth (100 bp flanks)"] SV_TSV_LONG["SV TSV + flank coverage"]:::output LONG_VCF["Long-read VCF"]:::output BAM_IDX --> CUTE_SV --> SURVIVOR BAM_IDX --> DEBREAK --> SURVIVOR BAM_IDX --> SNIFFLES --> SURVIVOR SURVIVOR --> BCFTOOLS_STATS --> LONG_VCF LONG_VCF --> VCF2TABLE_LONG --> BUILD_BED_LONG --> MOSDEPTH_LONG --> SV_TSV_LONG %% ===== STYLING ===== classDef input fill:#E3F2FD,stroke:#1565C0 classDef process fill:#B6ECE2,stroke:#065647 classDef output fill:#E8F5E9,stroke:#2E7D32 %% ===== LEGEND ===== subgraph LEGEND["Legend"] L1["Input"]:::input L2["Process"]:::process L3["Output file"]:::output end

Overview

These two folders contain the complete results from the long-read analysis pipeline using:

  • PacBio reads OR
  • Oxford Nanopore Technologies (ONT) reads

Both follow the same folder structure and processing logic.

Directory Structure

data/outputs/ont/
data/outputs/pacbio/
├── long-mod
│   ├── bam
│   └── unmapped_fastq
├── long-ref
│   ├── bam
│   ├── bcftools_stats
│   ├── cutesv_out
│   ├── debreak_out
│   ├── sniffles_out
│   ├── survivor_out
│   └── unmapped_fastq
├── long-ref-plasmid
│   ├── bam
│   └── unmapped_fastq
└── nanoplot
    └── SampleName_report

Output Subdirectories

long-ref/

Contains all outputs generated by mapping long reads to the reference genome.

Includes:

  • bam/ - Sorted and indexed BAM alignment files of long reads mapped to the reference genome.
  • bcftools_stats/ - Summary statistics of detected variants after variant calling.
  • cutesv_out/ - Structural variants called using cuteSV.
  • sniffles_out/ - Structural variants called using Sniffles.
  • debreak_out/ - Structural variants detected using DeBreak.
  • survivor_out/ - Merged structural variant callsets generated by SURVIVOR.
  • unmapped_fastq/ - Long reads that failed to align to the reference genome.

long-ref-plasmid/

This folder holds the mapping results of long reads aligned to the reference plasmid sequence. It is created only if a reference plasmid is present in the data/valid folder. A folder with a similar structure, long-mod-plasmid/, is created if a modified plasmid is present within the data/valid folder.

Includes:

  • bam/ - Plasmid-mapped long-read alignments
  • unmapped_fastq/ - FASTQ file containing reads that did not map to the plasmid

long-mod/

Contains alignments of long reads mapped to the modified/assembled genome.

Includes:

  • bam/ - Sorted alignment files
  • unmapped_fastq/ - Reads that failed to align to the modified genome

This enables comparison between mapping reads on reference vs modified assemblies.

nanoplot/

Contains long-read quality control and summary statistics generated using NanoPlot.

Example content:

  • SampleName_report/

Inside this folder you typically find:

  • Read length distributions
  • N50 / N90 statistics
  • Quality score profiles
  • Read length vs quality plots
  • Summary statistics of long-read sequencing quality

Tools Used

The table below summarises all tools used within the pipeline:

Tool Link for Further Information
samtools samtools
BCFtools BCFtools
cuteSV cuteSV
DeBreak DeBreak
Sniffles Sniffles
SURVIVOR SURVIVOR
NanoPlot NanoPlot

Citation

  • Wouter De Coster, Rosa Rademakers, NanoPack2: population-scale evaluation of long-read sequencing data, Bioinformatics, Volume 39, Issue 5, May 2023, btad311, https://doi.org/10.1093/bioinformatics/btad311

  • Twelve years of SAMtools and BCFtools Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

  • Jiang T et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol 21, 189 (2020). https://doi.org/10.1186/s13059-020-02107-y

  • Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Jeffares, Daniel C; Jolly, Clemency; Hoti, Mimoza; Speed, Doug; Shaw, Liam; Rallis, Charalampos; Balloux, Francois; Dessimoz, Christophe; Bähler, Jürg; Sedlazeck, Fritz J. Nature communications, Vol. 8, 14061, 24.01.2017, p. 1-11. DOI:10.1038/NCOMMS14061

  • Chen, Y., Wang, A.Y., Barkley, C.A. et al. Deciphering the exact breakpoints of structural variations using long sequencing reads with DeBreak. Nat Commun 14, 283 (2023). https://doi.org/10.1038/s41467-023-35996-1

  • Smolka, M., Paulin, L.F., Grochowski, C.M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat Biotechnol 42, 1571–1580 (2024). https://doi.org/10.1038/s41587-023-02024-y

  • Sedlazeck, F.J., Rescheneder, P., Smolka, M. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 15, 461–468 (2018). https://doi.org/10.1038/s41592-018-0001-7

See Also