Configuration File Guide¶

This document defines the JSON configuration file that ConfigManager uses to locate and process your files (genomes, reads, features).

Table of Contents¶

Quick Overview
Configuration Structure
File Format Support
Compression Support
Configuration Objects
Validator Settings in Config
Global Options
File-Level Settings
Examples

Quick Overview¶

Filename: config.json (standard)
Location: Same directory as your input files (paths are relative to config location)
Format: JSON (UTF-8)
Purpose: Specify reference/modified genomes, reads, and optional plasmids & features

Configuration Structure¶

Top-Level Keys¶

Key	Type	Required	Description
`ref_genome_filename`	GenomeConfig	✅	Reference genome (FASTA or GenBank)
`mod_genome_filename`	GenomeConfig	❌	Modified genome (FASTA or GenBank)
`ref_plasmid_filename`	GenomeConfig	❌	Reference plasmid (FASTA or GenBank)
`mod_plasmid_filename`	GenomeConfig	❌	Modified plasmid (FASTA or GenBank)
`reads`	List[ReadConfig]	✅	Read files (FASTQ/BAM), minimum one
`ref_feature_filename`	FeatureConfig	❌	Features for reference genome (BED, GFF, GTF)
`options`	dict	❌	Additional options (e.g., `{"threads": 8}`)

Important: - All file paths must be relative to the config file directory - At minimum, provide ref_genome_filename and reads

File Format Support¶

Genome and Plasmid Files¶

Format	Extensions
FASTA	`.fa`, `.fasta`, `.fna`
GenBank	`.gb`, `.gbk`, `.genbank`

Feature Files¶

Format	Extensions
GFF	`.gff`, `.gff3`
GTF	`.gtf`, `.gff2`
BED	`.bed`

Read Files¶

Format	Extensions
FASTQ	`.fastq`, `.fq`
BAM	`.bam`

Compression Support¶

All file types support transparent compression. The package automatically detects and handles compressed files.

Supported Compression Formats¶

Type	Extensions	Detection	Performance
gzip	`.gz`, `.gzip`	Automatic	Faster with `pigz`
bzip2	`.bz2`, `.bzip2`	Automatic	Faster with `pbzip2`

Performance tip: Install pigz and pbzip2 for 3-4x faster compression/decompression:

# Ubuntu/Debian
sudo apt-get install pigz pbzip2

# macOS
brew install pigz pbzip2

Compression Examples¶

{
  "ref_genome_filename": {"filename": "genome.fasta.gz"},
  "reads": [
    {"filename": "reads.fastq.bz2", "ngs_type": "illumina"}
  ]
}

Configuration Objects¶

GenomeConfig Object¶

Specifies a genome or plasmid file.

Structure:

{
  "filename": "path/to/genome.fasta"
}

Optional fields:

{
  "filename": "path/to/genome.fasta.gz",
  "validation_level": "trust",
  "n_sequence_limit": 10
}

Field	Type	Default	Description
`filename`	string	—	Path to genome file (relative to config)
`validation_level`	string	global / `"trust"`	Per-file validation level override
`threads`	integer	global / auto	Per-file thread count override
`n_sequence_limit`	integer	`5`	Maximum allowed number of sequences. Applies to `ref_genome_filename` and `mod_genome_filename` only — ignored with a warning on plasmids. When the genome contains more sequences than this limit, the assembly is considered too fragmented: a warning is logged, the file is copied to `data/valid/` as-is, and the pipeline will not run SyRI or ref-vs-mod comparison. Set higher for highly fragmented assemblies.

Example:

"mod_genome_filename": {
  "filename": "data/modified.fasta",
  "n_sequence_limit": 10
}

ReadConfig Object¶

Specifies sequencing read files. Supports individual files or directories.

Option 1: Individual Files¶

"reads": [
  {
    "filename": "samples/reads_R1.fastq.gz",
    "ngs_type": "illumina"
  },
  {
    "filename": "samples/reads_R2.fastq.gz",
    "ngs_type": "illumina"
  }
]

Option 2: Directory of Files¶

All files in the directory inherit the same settings. Each file becomes a separate read entry. This works for all NGS types (illumina, ont, pacbio).

"reads": [
  {
    "directory": "ont_reads/",
    "ngs_type": "ont",
    "validation_level": "trust"
  }
]

NGS Type Values¶

Value	Description	Typical Read Length
`"illumina"`	Illumina short reads (SE or PE)	50-300 bp
`"ont"`	Oxford Nanopore long reads	1-100+ kb
`"pacbio"`	PacBio long reads	10-50+ kb

FeatureConfig Object¶

Specifies feature annotation files (BED, GFF, GTF).

Structure:

{
  "filename": "path/to/features.gff"
}

Optional fields (validator settings):

{
  "filename": "path/to/features.gff",
  "validation_level": "strict"
}

Example:

"ref_feature_filename": {"filename": "annotations/reference.gff"}

Validator Settings in Config¶

You can specify validator-specific settings directly in your config file at two levels:

Global options (applies to ALL files)
File-level settings (applies to specific files)

These settings customize validation behavior without modifying code.

Global Options (`options` field)¶

Allowed global options:

Option	Type	Default	Description
`threads`	integer / `null`	auto-detect	Number of threads. `null` or omit = auto-detect from CPU cores. System warns if value exceeds available cores.
`validation_level`	string	`"trust"`	`"strict"`, `"trust"`, or `"minimal"`
`logging_level`	string	`"INFO"`	`"DEBUG"`, `"INFO"`, `"WARNING"`, or `"ERROR"`
`type`	string	`"prokaryote"`	`"prokaryote"` or `"eukaryote"`. Eukaryote skips inter-genome comparison.
`force_defragment_ref`	boolean	`false`	Unsupported workaround — use at your own risk. Merges all reference contigs into one sequence before validation. Use only when the reference is too fragmented for any workflow. All downstream results (alignment, variant calling, feature mapping) may be incorrect or meaningless. See warning below.

⚠ Warning: force_defragment_ref

This option is an unsupported workaround and is used entirely at your own responsibility. The EFSA pipeline does not support fragmented reference genomes. Merging contigs artificially alters the coordinate space of the reference, which means all downstream results — inter-genome alignment, variant calling, feature coordinate mapping — may be incorrect or meaningless. Do not use these results for regulatory submissions or biological conclusions without expert review. Obtain a properly assembled reference genome instead.

When force_defragment_ref is set to true in config.json, it takes priority over the --force-defragment-ref CLI flag. When set to false in config.json, the CLI flag is ignored even if provided.

Join-order TSV output¶

When defragmentation runs, a TSV file is written alongside the merged FASTA:

data/outputs/tables/tsv/<basename>_defragmented_join_order.tsv

Column	Description
`seq_id`	Original contig ID from the source FASTA
`length`	Length of that contig in base pairs
`start`	1-based start position of that contig in the merged sequence

Merge order is determined by the order sequences appear in the input FASTA file — the first record becomes the first segment, the second becomes the second, and so on. No sorting or reordering is applied. The start column encodes this order unambiguously: each contig begins immediately after the previous one ends, with no gaps or separators inserted between them.

This TSV is the only record of the original contig identities and their positions in the merged sequence. Keep it if you need to map coordinates back to the source contigs.

Example:

seq_id	length	start
contig_1	450000	1
contig_2	320000	450001
contig_3	180000	770001

Example config:

{
  "ref_genome_filename": {"filename": "genome.fasta"},
  "reads": [
    {"filename": "reads.fastq", "ngs_type": "illumina"}
  ],
  "options": {
    "threads": 8,
    "validation_level": "trust",
    "logging_level": "DEBUG",
    "type": "prokaryote",
    "force_defragment_ref": false
  }
}

Result: ALL files will use the specified options. Any option not set here falls back to its default.

Option priority¶

Options can be set from three sources. The first matching source wins:

config.json "options"  →  CLI flags (validation.sh)  →  built-in defaults

A value in config.json always takes priority over a CLI flag, and a CLI flag takes priority over the built-in default. This applies to all options including force_defragment_ref: if the config sets it to false, passing --force-defragment-ref on the command line has no effect.

Validation: - Invalid option names → ConfigurationError (e.g., "abc" not allowed) - Invalid threads → ConfigurationError (e.g., negative numbers) - Invalid validation_level → ConfigurationError (e.g., "invalid_level") - Invalid logging_level → ConfigurationError (e.g., "verbose", must be DEBUG/INFO/WARNING/ERROR) - Invalid type → ConfigurationError (e.g., "bacteria", must be "prokaryote" or "eukaryote") - Invalid force_defragment_ref → ConfigurationError (must be true or false, not a string)

File-Level Settings¶

Override global options for specific files by adding settings to individual file entries: - validation_level: "strict", "trust", or "minimal" (overrides global) - threads: Number of threads for compression (int, overrides global) - n_sequence_limit: Maximum number of sequences allowed in a genome file (int, default: 5). Applies to ref_genome_filename and mod_genome_filename only; ignored with a warning on plasmids.

Warnings: When a file-level setting overrides a global option, a WARNING is logged:

WARNING: File-level setting 'validation_level=strict' overrides global option 'validation_level=trust'

Examples¶

Minimal Configuration¶

Simplest valid configuration with required fields only:

{
  "ref_genome_filename": {"filename": "ref.fasta"},
  "reads": [
    {"filename": "reads.fastq", "ngs_type": "illumina"}
  ]
}

Full Configuration¶

Complete example with all optional fields and global options:

{
  "ref_genome_filename": {
    "filename": "ref.gbk",
    "validation_level": "strict",
    "threads": 8,
    "n_sequence_limit": 5
  },
  "mod_genome_filename": {
    "filename": "mod.fasta.gz",
    "validation_level": "strict",
    "threads": 8,
    "n_sequence_limit": 5
  },
  "ref_plasmid_filename": {
    "filename": "plasmid_ref.gbk",
    "validation_level": "strict",
    "threads": 8
    },
  "mod_plasmid_filename": {
    "filename": "plasmid_mod.fasta",
    "validation_level": "strict",
    "threads": 8
    },
  "reads": [
    {
      "filename": "illumina_R1.fastq.gz",
      "ngs_type": "illumina",
      "validation_level": "strict",
      "threads": 8
    },
    {
      "filename": "illumina_R2.fastq.gz",
      "ngs_type": "illumina",
      "validation_level": "strict",
      "threads": 8
    },
    {
      "directory": "ont_reads/",
      "ngs_type": "ont",
      "validation_level": "strict",
      "threads": 8
    }
  ],
  "ref_feature_filename": {
    "filename": "features_ref.gff3",
    "validation_level": "strict",
    "threads": 8
  },
  "options": {
    "threads": 8,
    "validation_level": "strict",
    "logging_level": "INFO",
    "type": "prokaryote"
  }
}

Configuration File Guide¶

Table of Contents¶

Quick Overview¶

Configuration Structure¶

Top-Level Keys¶

File Format Support¶

Genome and Plasmid Files¶

Feature Files¶

Read Files¶

Compression Support¶

Supported Compression Formats¶

Compression Examples¶

Configuration Objects¶

GenomeConfig Object¶

ReadConfig Object¶

Option 1: Individual Files¶

Option 2: Directory of Files¶

NGS Type Values¶

FeatureConfig Object¶

Validator Settings in Config¶

Global Options (options field)¶

Join-order TSV output¶

Option priority¶

File-Level Settings¶

Examples¶

Minimal Configuration¶

Full Configuration¶

Global Options (`options` field)¶