Generation of per structural variation type CSV tables
Generation of per structural variation (SV) type CSV tables¶
These utilities convert SV VCFs into compact TSV summaries, enrich short/long-read SV rows with flank coverage using mosdepth, and then merge available summaries into per-SV-type CSV tables.
Key points¶
- By default a nextflow pipeline is collecting the tables from pipelines and runs restructure_sv_tbl to create all summary tables
- Variants are extracted into a table format with processes
vcf_to_tableandvcf_to_table_long - For short-read and long-read SV tables, flank coverage is added by
build_sv_flank_bedandmosdepth: coverage_before_100bp: mean depth in the 100 bp upstream flankcoverage_after_100bp: mean depth in the 100 bp downstream flank- If one of the pipelines was not running (short/long/assembly) an empty tsv file is generated with a process create_empty_tbl
restructure_sv_tblprocess: the merge step accepts any subset of (assembly, long_ont, long_pacbio, short) and ignores missing files.- Long reads are handled as two separate sources:
long_ontandlong_pacbio. Output CSVs keep these in distinctlong_ont_*andlong_pacbio_*columns. - Final event rows are first built by clustering records within the same chromosome and standardized SV type, then a final pass adds
linked_evententries for overlapping final SV rows on the same chromosome. linked_eventis the only relationship column in the final CSVs. It includes both same-type and cross-type overlaps.- Final event anchoring is deterministic and size-aware:
event_length_bpuses minimum absolutesvlen, and event coordinates are anchored to the same selected source call. Equal-length ties are resolved by strongest intersection with the other source calls.
VCF Extraction and Variant Type Handling¶
The pipeline extracts variants from VCF files using different fields depending on the source:
Assembly (syri) variants:
- Variant type source: VCF ALT field (e.g., DEL, DUP, INV, INS, TRANS, INVDP, CPG, CPL, SYN, etc.)
- Extraction command: bcftools query -f '%CHROM\t%POS\t%INFO/END\t%ALT\t%ID\t%INFO/StartB\t%INFO/EndB\n'
- Columns extracted: chrom, start (POS), end (INFO/END), svtype (ALT), info_svtype (ID), start_mod (InfoStartB), end_mod (EndB)
Short-read (delly) variants:
- Variant type source: VCF INFO/SVTYPE field (e.g., DEL, DUP, INV, INS, TRA)
- Extraction command: bcftools query -f '%CHROM\t%POS\t%INFO/END\t%INFO/SVTYPE\t%INFO/CHR2\t%POS2\t%ALT\t%INFO/SVLEN\t%INFO/PE\t%QUAL\t[%RDCN]\n'
- Key feature: Includes svlen directly from VCF for accurate insertion/translocation lengths
Long-read (cuteSV/sniffles/debreak/SURVIVOR) variants:
- Variant type source: VCF ID field or SVTYPE in INFO
- Extraction command: bcftools query -f '%CHROM\t%POS\t%INFO/END\t%INFO/SVTYPE\t%ID\t%INFO/SVLEN\t[%DR{1}]\t%QUAL\t%INFO/SUPP\n'
- Supporting reads: Extracted from FORMAT/DR (DR{1}) for long-read evidence
- Supporting methods: Populated from INFO/SUPP and stored in long_(ont|pacbio)_supporting_methods
Flank coverage with mosdepth¶
For short-read and long-read calls, the pipeline adds local depth around each SV event before final table merging:
- Processes:
build_sv_flank_bed(build regions) andmosdepth(compute depth) - Input: indexed BAM + SV TSV (
chrom,start,end) - Flanks: 100 bp upstream and 100 bp downstream of each event
- Output columns in TSV:
coverage_before_100bp,coverage_after_100bp - Assembly (
asm) records are not coverage-enriched because no assembly BAM is used in this step.
Variant Type Standardization¶
All extracted variant types are standardized to one of six categories in create_sv_output.py:
DEL(Deletion)DUP(Duplication)INS(Insertion)INV(Inversion)TRA(Translocation) — includes TRANS, BND, CTX, and other inter-chromosomal variantsRPL(Replacement/Other) — includes SUB, SNV, SYN, HDR, and unrecognized types
Mapping strategy:
1. Direct lookup in INFO_MAP for exact type matches
2. Token-based search for substring matches (e.g., "INVDP" contains "INV" → INV)
3. Assembly-specific prefix patterns for syri-specific types (e.g., CPG→INS, CPL→DEL, SYN→RPL)
4. All unmatched types default to RPL (replacement)
This ensures all 20+ syri variant types are correctly categorized and no variants are lost during processing.
data/outputs/tables/
├── csv_per_sv_summary
│ ├── Deletions.csv
│ ├── Duplications.csv
│ ├── Insertions.csv
│ ├── Inversions.csv
│ ├── Replacements.csv
│ └── Translocations.csv
├── supporting_reads/
│ ├── sample1_method_supporting_reads.tsv
│ └── ...
└── tsv
├── assembly_sv_summary.tsv
├── short_sv_summary.tsv
├── mab-pb_sv_summary.tsv
└── map-ont_sv_summary.tsv
Supporting Reads Extraction¶
The extract_supp_reads process extracts supporting reads information from structural variant (SV) VCF files. This provides detailed evidence for each SV call, showing how many reads support the variant.
Process details¶
- Input: SV VCF file, parameter name (e.g.,
PEfor paired-end reads), method name (e.g.,delly) - Output: TSV file in
data/outputs/tables/supporting_reads/with columns: chrom: Chromosomestart: Start positionend: End positionsvtype: SV typesupporting_reads: Number of supporting reads (extracted via the specified parameter)
Example output¶
This output complements the summary tables by providing granular read-level evidence for SV calls.
Example command¶
python3 modules/utils/create_sv_output.py --asm assembly_sv_summary.tsv \
--long_ont sample1_ont_sv_summary.tsv \
--long_pacbio sample1_pacbio_sv_summary.tsv \
--short sample1_short_sv_summary.tsv \
--out csv_per_sv_sumary
All supported processing script options¶
| Option | Description |
|---|---|
--asm |
TSV file containing structural variant summary from assembly-based calling. Optional. |
--long_ont |
TSV file containing structural variant summary from Oxford Nanopore long-read data. Optional. |
--long_pacbio |
TSV file containing structural variant summary from PacBio long-read data. Optional. |
--short |
TSV file containing structural variant summary from short-read sequencing data. Optional. |
--out |
Output directory for the per-SV CSV files. Required. |
--tol |
Within-type clustering tolerance in base pairs. Determines whether raw SV calls get merged into the same event. Default: 10. |
--cross_type_tol |
Tolerance in base pairs for linking final events with near-identical coordinates in linked_event. Default: 0, which keeps overlap-only linking. |
Explanation of csv_per_sv_summary CSV columns¶
The final table in each CSV file contains one row per final structural variant (SV) event, with coordinates and evidence aggregated across assembly-based, long-read, and short-read pipelines.
Column prefixes
- asm_ — values reported by the assembly-based SV pipeline
- long_ont_ — values reported by the Oxford Nanopore long-read SV pipeline
- long_pacbio_ — values reported by the PacBio long-read SV pipeline
- short_ — values reported by the short-read SV pipeline
Common event-level and pipeline-derived columns¶
| Column name | Description |
|---|---|
| event_id | Unique identifier of the final structural variant event, such as DEL_1 or DUP_2. |
| chrom | Chromosome where the SV is located (VCF CHROM). |
| std_svtype | Standardized SV type harmonized across pipelines. Current values are DEL, DUP, INS, RPL, INV, and TRA. |
| event_start | Start coordinate of the selected representative call used to anchor the final event. This is taken from the same source call that determines event_length_bp (minimum absolute svlen). |
| event_end | End coordinate of the selected representative call used to anchor the final event. This is taken from the same source call that determines event_length_bp (minimum absolute svlen). |
| event_length_bp | Representative event size in base pairs, computed as the minimum available absolute source length (min(abs(svlen))) across assembly, long ONT, long PacBio, and short source representatives. If no source provides svlen, this field is NaN and coordinates use fallback cluster logic. |
| support_score | Number of input sources contributing to the final event row. In the current implementation this is the count of non-empty calls among asm, long_ont, long_pacbio, and short. |
| percentage_overlap | Comma-separated overlap percentages collected during same-type event clustering. Each value is calculated during one clustering merge step as (intersection length / longer interval length) × 100. This field is empty when the final event was built from a single record only. |
| linked_event | Semicolon-separated list of overlapping final SV events on the same chromosome. This single column includes both same-type and cross-type links. Each linked entry has the format <event_id> (<std_svtype>, <chrom>:<start>-<end>, <relation>). Standard relation values are exact_coordinates, overlap, nested_in, and contains, always from the point of view of the current row. If --cross_type_tol is set above 0, near-identical boundaries may also be reported as same_coordinates_within_<N>bp. Leave empty when no linked events are found. |
Possible values in linked_event¶
The examples below use simplified coordinates for clarity.
| Example current event | Example linked event entry | Meaning |
|---|---|---|
DEL_2 at chr1:23-67 |
RPL_1 (RPL, chr1:23-67, exact_coordinates) |
The linked event has exactly the same coordinates as the current event. |
DEL_2 at chr1:23-67 |
INV_1 (INV, chr1:10-90, nested_in) |
The current event is fully inside the linked event interval. |
RPL_1 at chr1:10-90 |
DEL_2 (DEL, chr1:23-67, contains) |
The current event fully contains the linked event interval. |
DEL_2 at chr1:23-67 |
DEL_3 (DEL, chr1:60-100, overlap) |
The two events partially overlap, but neither fully contains the other. |
DEL_2 at chr1:23-67 with --cross_type_tol 5 |
RPL_2 (RPL, chr1:25-69, same_coordinates_within_5bp) |
The events do not overlap exactly, but their start and end coordinates are both within the specified tolerance. |
Additional pipeline-specific columns¶
| Column name | Description |
|---|---|
| long_(ont|pacbio)_supporting_reads | Number of Oxford Nanopore or PacBio reads supporting the structural variant (VCF FORMAT field DR, when present). |
| long_(ont|pacbio)_supporting_methods | Number or label of long-read variant calling methods supporting the structural variant, derived from the TSV summary when available. |
| long_(ont|pacbio)_coverage_before_100bp | Mean depth in the 100 bp flank before the long-read SV event, computed by mosdepth. |
| long_(ont|pacbio)_coverage_after_100bp | Mean depth in the 100 bp flank after the long-read SV event, computed by mosdepth. |
| short_chr2 | Partner chromosome for short-read translocation/breakend calls (from short-read TSV chr2, extracted from VCF INFO/CHR2). Empty for non-translocation short-read events or when unavailable. |
| short_pos2 | Partner breakpoint position for short-read translocation/breakend calls (from short-read TSV pos2, extracted from VCF INFO/POS2). Empty for non-translocation short-read events or when unavailable. |
| short_reads_copy_number_estimate | Estimated copy number derived from short-read depth information (VCF FORMAT field RDCN). |
| short_coverage_before_100bp | Mean depth in the 100 bp flank before the short-read SV event, computed by mosdepth. |
| short_coverage_after_100bp | Mean depth in the 100 bp flank after the short-read SV event, computed by mosdepth. |
Source-specific length columns and calculation strategy¶
Structural Variant Type Conventions by Data Source¶
Short-read variants (delly):
- DEL, DUP, INV: reported as real intervals with start < end
- INS: reported as a single position (start == end) representing the insertion point
- TRA: reported as a breakpoint (start == end) representing the breakpoint position
- For INS and TRA, the inserted/translocated sequence length is provided in the svlen field, not from coordinate difference
Assembly variants (syri):
- Syri uses the VCF ALT field for variant types: DEL, INS, INV, DUP, TRANS (translocation), CPG (copy gain), CPL (copy loss), SYN (syntenic), and alignment/inverted variants
- These are mapped to standardized types: DEL, INS, INV, DUP, TRA, RPL (replacements)
- Coordinates extracted as real intervals from VCF POS and INFO/END fields
- Breakpoint information available in INFO/StartB and INFO/EndB (stored as asm_start_mod and asm_end_mod)
Long-read variants (cuteSV, sniffles, debreak, SURVIVOR merged):
- Reported with SVTYPE in INFO field
- Handled similarly to short-read variants with clustering and best-record selection
Length Derivation Logic¶
The create_sv_output.py script handles svlen consistently across all sources:
- If
svlenis provided in the input TSV/VCF: Use it directly as source length - If
svlenis missing: - For interval variants (DEL, DUP, INV, RPL): Derive
svlen = end - start - For TRA: Set
svlen = 0(breakpoint semantics) - For INS: Keep as missing (
None/NaN) unless explicitly provided by caller -
For unknown types: Attempt coordinate-based derivation, fallback to
None -
If
svlenis present but signed or inconsistent (interval variants): - Normalize sign first:
svlen = abs(svlen) - For
DEL,DUP,INV,RPL, normalize to coordinate interval length:svlen = end - start
This ensures accurate length reporting regardless of variant type and source pipeline.
Event Length Calculation for Final Rows¶
In the final CSV tables:
- event_length_bp is computed as the minimum valid absolute svlen (min(abs(svlen))) across all source records in the clustered event
- If no sources provide svlen, event_length_bp is NaN
- This conservative approach ensures reported lengths represent the smallest source size estimate while handling signed caller conventions robustly
Each source pipeline registers its own length in the table: - asm_length — Assembly pipeline svlen - long_ont_length — Oxford Nanopore long-read svlen - long_pacbio_length — PacBio long-read svlen - short_length — Short-read svlen
Assembly coordinates for translocations (asm_start_mod, asm_end_mod)¶
Two additional assembly-specific columns appear only in Translocations.csv:
- asm_start_mod — Start position of the translocation event in the modified (non-reference) genome
- asm_end_mod — End position of the translocation event in the modified (non-reference) genome
These columns are automatically removed from all non-translocation tables (Insertions, Deletions, Duplications, Replacements, Inversions) to maintain table clarity and avoid sparse empty columns.
Rationale: Translocations require two coordinate pairs to describe both breakpoint locations. The primary coordinates (event_start, event_end) mark the position in the reference genome (origin breakpoint), while these modifier coordinates mark the same event's position in the modified genome (destination breakpoint).
Event coordinate computation workflow¶
The create_sv_output.py script processes SV records through the following steps:
-
Load and standardize records from all available source pipelines (assembly, long-read ONT/PacBio, short-read)
-
Cluster records by (chromosome, standardized SV type) using interval overlap with a tolerance window (
--tol, default 10 bp). Records are considered part of the same event if: - They share the same chromosome and standardized SV type
- Their intervals overlap (accounting for tolerance)
-
At least one of the breakpoints (start or end) is within tolerance between members
-
Select best representative per source within each cluster using a ranking strategy:
- Rank 1: Supporting reads / evidence count (higher is better)
- Rank 2: Quality score (higher is better)
- Rank 3: Absolute SV size (
abs(svlen)), with smaller values preferred as tie-breaker
This ensures the highest-confidence call from each source is carried forward.
- Build source length candidates from selected source representatives:
asm_length,long_ont_length,long_pacbio_length,short_lengthfrom sourcesvlen-
Event-level comparison uses absolute lengths (
abs(svlen)) to normalize caller sign conventions -
Select event anchor and event length:
- Set
event_length_bp = min(abs(svlen))across available source representatives - Set
event_startandevent_endto the coordinates of the same selected source call - If multiple sources share the same minimum absolute length, choose the one with the largest total interval intersection against other source representatives
-
If no usable source
svlenexists, keepevent_length_bp = NaNand use type-aware fallback coordinates -
Assemble final row with all source-specific fields, filtering unnecessary columns (e.g., removing
asm_start_mod/asm_end_modfrom deletions, removing internal type fields) -
Final pass: link overlapping events by scanning all final rows on the same chromosome and recording any coordinate overlaps or near-overlaps (if
--cross_type_tolis set)