Input Validation Overview¶
Input Scenarios and Preprocessing Logic¶
The validation module not only verifies input formats, but also determines how genome assemblies are interpreted and preprocessed before entering the pipeline.
Depending on the structure of ref.fa and mod.fa, different strategies are applied for:
- chromosome and plasmid separation
- contig handling
- usage of minimap2 for sequence mapping
- preparation of files in data/valid/
The following table summarizes all supported scenarios:
| # | Scenario | Mode (config.json) |
Input Structure | Plasmids Handling | run_ref_x_mod |
minimap2 Mapping | mod.fa Processing | Modules Run | Output in data/valid/ |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Single contig + plasmids | prokaryote |
ref.fa: 1 sequence (+ optional plasmids) mod.fa: 1 sequence (+ optional plasmids) |
In ref.fa: - Longest sequence → chromosome - Remaining → plasmids *ref_plasmid.fasta In mod.fa: plasmids = sequences not mapped to reference |
True | Used to identify unmapped regions (plasmids in mod.fa) | Reduced to 1 contig (chromosome only) | All modules | ref.fa, mod.fa (1 contig) *_contig_0.fasta *_plasmid.fasta |
| 2 | Fragmented assembly (below limit) | prokaryote |
ref.fa: 1 sequence mod.fa: multiple sequences (≤ limit) |
In reference: - In ref.fa: - Longest sequence → chromosome - Remaining → plasmids *ref_plasmid.fasta In mod.fa: - Unmapped sequences → plasmids - Mapped sequences → contigs |
True | Used to split mod.fa into mapped contigs vs unmapped plasmids | - Split into individual contigs (*_contig.fasta) - mod.fa becomes multifasta without plasmids |
All modules | ref.fa, mod.fa + contig set *_contig.fasta *_plasmid.fasta |
| 3 | Fragmented assembly (above limit) | prokaryote |
ref.fa: 1 sequence mod.fa: multiple sequences (> limit) |
In reference: - In ref.fa: - Longest sequence → chromosome - Remaining → plasmids *ref_plasmid.fasta In mod.fa: no plasmid detection |
False | Not used | No processing (mod.fa copied as-is) | Mapping-only modules | ref.fa, mod.fa (copied) *_plasmid.fasta |
| 4 | Multiple sequences in reference | prokaryote / eukaryote |
ref.fa: multiple sequences (non-plasmid) mod.fa: one or more sequences |
No plasmids considered | False | Not used | No processing (files copied as-is) | Mapping-only modules | ref.fa, mod.fa (copied) |
| 5 | Fragmented reference + force_defragment_ref (unsupported workaround) |
prokaryote / eukaryote |
ref.fa: multiple sequences (fragmented, above limit) → merged to 1 before validation | In ref.fa: all contigs and plasmids merged into a single chromosome (*_defragmented.fasta) |
True | Not used | Same for scenarios above | GFF annotation disabled | ref.fa (joined), *_defragmented_join_order.tsv, mod.fa depends on scenario |
Important!
By default the limit (
n_sequence_limit) mentioned in the table above is set to 5 for both reference and assembly fasta files. Please adjust the n_sequence_limit indata/inputs/config.json.Important!
When the container is built please follow the steps to preprocess the data with a validation package.
The input validation module preprocesses and verifies all input data to ensure it meets the required format and structure before the Nextflow pipeline is executed.
Purpose¶
The validation module ensures that:
- All input files are in the correct format
- Files are properly structured and can be parsed
- Data meets quality standards
- Files are converted to standardized formats for pipeline processing
How It Works¶
The validation process:
- Reads the configuration file from
data/inputs/config.json - Validates each input file according to its type (genome, reads, features)
- Converts files to standardized formats
- Outputs validated files to a timestamped run directory
data/valid/run_YYYYMMDD_HHMMSS/ - Writes
data/valid/validated_params.json(at the top level, not inside the run directory) with validated file paths and flags for Nextflow - Generates a log and report inside the run directory
Running Validation¶
Use the validate wrapper script (preferred):
validate # default config path
validate --config path/to/config.json # custom config
# Global options can be set via CLI flags (config.json takes priority if the same
# option is also set there):
validate --threads 8
validate --validation-level strict
validate --logging-level DEBUG
validate --type eukaryote
validate --force-defragment-ref # unsupported workaround — at your own responsibility
# has no effect if force_defragment_ref is set in config.json
Priority of configurations: config.json > cli_options > defaults
Related Documentation¶
Output¶
After successful validation:
- Validated files are placed in
data/valid/run_YYYYMMDD_HHMMSS/(a new timestamped directory per run; previous runs are preserved) data/valid/validated_params.jsonis written at the top level ofdata/valid/so Nextflow can always find it at a fixed path- If any genome exceeds
n_sequence_limitortypeis"eukaryote", the file is still copied butrun_ref_x_modwill be set tofalse - Log and report are written inside the run directory
validated_params.json¶
This file is produced by the validation step and consumed by Nextflow via -params-file. It overrides the defaults in nextflow.config and nextflow_schema.json.
Pipeline switches¶
| Parameter | Type | Description |
|---|---|---|
run_ref_x_mod |
boolean | true when both reference and modified genome validation succeeded and neither is fragmented; false when any genome exceeds n_sequence_limit or type is "eukaryote". Gates all ref-vs-mod steps. |
run_truvari |
boolean | Always false by default; can be overridden manually in data/valid/validated_params.json. |
run_illumina |
boolean | true when validated Illumina FASTQ reads are present. |
run_nanopore |
boolean | true when validated Nanopore (ONT) reads are present (FASTQ or BAM). |
run_pacbio |
boolean | true when validated PacBio reads are present (FASTQ or BAM). |
contig_file_size |
integer | Number of contig files produced by inter-genome characterisation. |
run_vcf_annotation |
boolean | true when a validated GFF feature file is available. |
validation_timestamp |
string | Timestamp of the validation run (YYYYMMDD_HHMMSS). |
File paths (null or empty list when absent)¶
| Parameter | Type | Description |
|---|---|---|
ref_fasta_validated |
string | Path to the validated reference genome FASTA. |
mod_fasta_validated |
string | Path to the validated modified genome FASTA. |
ref_plasmid_fasta |
string | Path to the validated reference plasmid FASTA (if present). |
mod_plasmid_fasta |
string | Path to the validated modified plasmid FASTA (if present). |
gff |
string | Path to the validated reference GFF/GFF3 file. |
illumina_fastqs |
string array | Paths to all validated Illumina FASTQ files. |
ont_fastqs |
string array | Paths to all validated Nanopore FASTQ files. |
ont_bams |
string array | Paths to all validated Nanopore BAM files (copied as-is). |
pacbio_fastqs |
string array | Paths to all validated PacBio FASTQ files. |
pacbio_bams |
string array | Paths to all validated PacBio BAM files (copied as-is). |
contig_files |
string array | Paths to contig FASTA files from inter-genome characterisation. |
BAM files: When a PacBio or ONT input is provided as BAM, it is copied to the run directory without conversion. Its path appears in
pacbio_bams/ont_bams— separate from the FASTQ lists — so Nextflow can handle the two formats independently. Inter-read (R1/R2 pairing) validation is skipped when all reads for a sample are BAM.