Configuration File Guide¶
This document defines the JSON configuration file that ConfigManager uses to locate and process your files (genomes, reads, features).
Table of Contents¶
- Quick Overview
- Configuration Structure
- File Format Support
- Compression Support
- Configuration Objects
- Validator Settings in Config
- Global Options
- File-Level Settings
- Examples
Quick Overview¶
- Filename:
config.json(standard) - Location: Same directory as your input files (paths are relative to config location)
- Format: JSON (UTF-8)
- Purpose: Specify reference/modified genomes, reads, and optional plasmids & features
Configuration Structure¶
Top-Level Keys¶
| Key | Type | Required | Description |
|---|---|---|---|
ref_genome_filename |
GenomeConfig | ✅ | Reference genome (FASTA or GenBank) |
mod_genome_filename |
GenomeConfig | ❌ | Modified genome (FASTA or GenBank) |
ref_plasmid_filename |
GenomeConfig | ❌ | Reference plasmid (FASTA or GenBank) |
mod_plasmid_filename |
GenomeConfig | ❌ | Modified plasmid (FASTA or GenBank) |
reads |
List[ReadConfig] | ✅ | Read files (FASTQ/BAM), minimum one |
ref_feature_filename |
FeatureConfig | ❌ | Features for reference genome (BED, GFF, GTF) |
options |
dict | ❌ | Additional options (e.g., {"threads": 8}) |
Important:
- All file paths must be relative to the config file directory
- At minimum, provide ref_genome_filename and reads
File Format Support¶
Genome and Plasmid Files¶
| Format | Extensions |
|---|---|
| FASTA | .fa, .fasta, .fna |
| GenBank | .gb, .gbk, .genbank |
Feature Files¶
| Format | Extensions |
|---|---|
| GFF | .gff, .gff3 |
| GTF | .gtf, .gff2 |
| BED | .bed |
Read Files¶
| Format | Extensions |
|---|---|
| FASTQ | .fastq, .fq |
| BAM | .bam |
Compression Support¶
All file types support transparent compression. The package automatically detects and handles compressed files.
Supported Compression Formats¶
| Type | Extensions | Detection | Performance |
|---|---|---|---|
| gzip | .gz, .gzip |
Automatic | Faster with pigz |
| bzip2 | .bz2, .bzip2 |
Automatic | Faster with pbzip2 |
Performance tip: Install pigz and pbzip2 for 3-4x faster compression/decompression:
Compression Examples¶
{
"ref_genome_filename": {"filename": "genome.fasta.gz"},
"reads": [
{"filename": "reads.fastq.bz2", "ngs_type": "illumina"}
]
}
Configuration Objects¶
GenomeConfig Object¶
Specifies a genome or plasmid file.
Structure:
Optional fields:
| Field | Type | Default | Description |
|---|---|---|---|
filename |
string | — | Path to genome file (relative to config) |
validation_level |
string | global / "trust" |
Per-file validation level override |
threads |
integer | global / auto | Per-file thread count override |
n_sequence_limit |
integer | 5 |
Maximum allowed number of sequences. Applies to ref_genome_filename and mod_genome_filename only — ignored with a warning on plasmids. When the genome contains more sequences than this limit, the assembly is considered too fragmented: a warning is logged, the file is copied to data/valid/ as-is, and the pipeline will not run SyRI or ref-vs-mod comparison. Set higher for highly fragmented assemblies. |
Example:
ReadConfig Object¶
Specifies sequencing read files. Supports individual files or directories.
Option 1: Individual Files¶
"reads": [
{
"filename": "samples/reads_R1.fastq.gz",
"ngs_type": "illumina"
},
{
"filename": "samples/reads_R2.fastq.gz",
"ngs_type": "illumina"
}
]
Option 2: Directory of Files¶
All files in the directory inherit the same settings. Each file becomes a separate read entry. This works for all NGS types (illumina, ont, pacbio).
NGS Type Values¶
| Value | Description | Typical Read Length |
|---|---|---|
"illumina" |
Illumina short reads (SE or PE) | 50-300 bp |
"ont" |
Oxford Nanopore long reads | 1-100+ kb |
"pacbio" |
PacBio long reads | 10-50+ kb |
FeatureConfig Object¶
Specifies feature annotation files (BED, GFF, GTF).
Structure:
Optional fields (validator settings):
Example:
Validator Settings in Config¶
You can specify validator-specific settings directly in your config file at two levels:
- Global options (applies to ALL files)
- File-level settings (applies to specific files)
These settings customize validation behavior without modifying code.
Global Options (options field)¶
Allowed global options:
| Option | Type | Default | Description |
|---|---|---|---|
threads |
integer / null |
auto-detect | Number of threads. null or omit = auto-detect from CPU cores. System warns if value exceeds available cores. |
validation_level |
string | "trust" |
"strict", "trust", or "minimal" |
logging_level |
string | "INFO" |
"DEBUG", "INFO", "WARNING", or "ERROR" |
type |
string | "prokaryote" |
"prokaryote" or "eukaryote". Eukaryote skips inter-genome comparison. |
force_defragment_ref |
boolean | false |
Unsupported workaround — use at your own risk. Merges all reference contigs into one sequence before validation. Use only when the reference is too fragmented for any workflow. All downstream results (alignment, variant calling, feature mapping) may be incorrect or meaningless. See warning below. |
⚠ Warning:
force_defragment_refThis option is an unsupported workaround and is used entirely at your own responsibility. The EFSA pipeline does not support fragmented reference genomes. Merging contigs artificially alters the coordinate space of the reference, which means all downstream results — inter-genome alignment, variant calling, feature coordinate mapping — may be incorrect or meaningless. Do not use these results for regulatory submissions or biological conclusions without expert review. Obtain a properly assembled reference genome instead.
When
force_defragment_refis set totrueinconfig.json, it takes priority over the--force-defragment-refCLI flag. When set tofalseinconfig.json, the CLI flag is ignored even if provided.
Join-order TSV output¶
When defragmentation runs, a TSV file is written alongside the merged FASTA:
| Column | Description |
|---|---|
seq_id |
Original contig ID from the source FASTA |
length |
Length of that contig in base pairs |
start |
1-based start position of that contig in the merged sequence |
Merge order is determined by the order sequences appear in the input FASTA file — the first record becomes the first segment, the second becomes the second, and so on. No sorting or reordering is applied. The start column encodes this order unambiguously: each contig begins immediately after the previous one ends, with no gaps or separators inserted between them.
This TSV is the only record of the original contig identities and their positions in the merged sequence. Keep it if you need to map coordinates back to the source contigs.
Example:
| seq_id | length | start |
|---|---|---|
| contig_1 | 450000 | 1 |
| contig_2 | 320000 | 450001 |
| contig_3 | 180000 | 770001 |
Example config:
{
"ref_genome_filename": {"filename": "genome.fasta"},
"reads": [
{"filename": "reads.fastq", "ngs_type": "illumina"}
],
"options": {
"threads": 8,
"validation_level": "trust",
"logging_level": "DEBUG",
"type": "prokaryote",
"force_defragment_ref": false
}
}
Result: ALL files will use the specified options. Any option not set here falls back to its default.
Option priority¶
Options can be set from three sources. The first matching source wins:
A value in config.json always takes priority over a CLI flag, and a CLI flag takes priority over the built-in default. This applies to all options including force_defragment_ref: if the config sets it to false, passing --force-defragment-ref on the command line has no effect.
Validation:
- Invalid option names → ConfigurationError (e.g., "abc" not allowed)
- Invalid threads → ConfigurationError (e.g., negative numbers)
- Invalid validation_level → ConfigurationError (e.g., "invalid_level")
- Invalid logging_level → ConfigurationError (e.g., "verbose", must be DEBUG/INFO/WARNING/ERROR)
- Invalid type → ConfigurationError (e.g., "bacteria", must be "prokaryote" or "eukaryote")
- Invalid force_defragment_ref → ConfigurationError (must be true or false, not a string)
File-Level Settings¶
Override global options for specific files by adding settings to individual file entries:
- validation_level: "strict", "trust", or "minimal" (overrides global)
- threads: Number of threads for compression (int, overrides global)
- n_sequence_limit: Maximum number of sequences allowed in a genome file (int, default: 5). Applies to ref_genome_filename and mod_genome_filename only; ignored with a warning on plasmids.
Warnings: When a file-level setting overrides a global option, a WARNING is logged:
WARNING: File-level setting 'validation_level=strict' overrides global option 'validation_level=trust'
Examples¶
Minimal Configuration¶
Simplest valid configuration with required fields only:
{
"ref_genome_filename": {"filename": "ref.fasta"},
"reads": [
{"filename": "reads.fastq", "ngs_type": "illumina"}
]
}
Full Configuration¶
Complete example with all optional fields and global options:
{
"ref_genome_filename": {
"filename": "ref.gbk",
"validation_level": "strict",
"threads": 8,
"n_sequence_limit": 5
},
"mod_genome_filename": {
"filename": "mod.fasta.gz",
"validation_level": "strict",
"threads": 8,
"n_sequence_limit": 5
},
"ref_plasmid_filename": {
"filename": "plasmid_ref.gbk",
"validation_level": "strict",
"threads": 8
},
"mod_plasmid_filename": {
"filename": "plasmid_mod.fasta",
"validation_level": "strict",
"threads": 8
},
"reads": [
{
"filename": "illumina_R1.fastq.gz",
"ngs_type": "illumina",
"validation_level": "strict",
"threads": 8
},
{
"filename": "illumina_R2.fastq.gz",
"ngs_type": "illumina",
"validation_level": "strict",
"threads": 8
},
{
"directory": "ont_reads/",
"ngs_type": "ont",
"validation_level": "strict",
"threads": 8
}
],
"ref_feature_filename": {
"filename": "features_ref.gff3",
"validation_level": "strict",
"threads": 8
},
"options": {
"threads": 8,
"validation_level": "strict",
"logging_level": "INFO",
"type": "prokaryote"
}
}