Skip to content

Configuration File Guide

This document defines the JSON configuration file that ConfigManager uses to locate and process your files (genomes, reads, features).

Table of Contents


Quick Overview

  • Filename: config.json (standard)
  • Location: Same directory as your input files (paths are relative to config location)
  • Format: JSON (UTF-8)
  • Purpose: Specify reference/modified genomes, reads, and optional plasmids & features

Configuration Structure

Top-Level Keys

Key Type Required Description
ref_genome_filename GenomeConfig Reference genome (FASTA or GenBank)
mod_genome_filename GenomeConfig Modified genome (FASTA or GenBank)
ref_plasmid_filename GenomeConfig Reference plasmid (FASTA or GenBank)
mod_plasmid_filename GenomeConfig Modified plasmid (FASTA or GenBank)
reads List[ReadConfig] Read files (FASTQ/BAM), minimum one
ref_feature_filename FeatureConfig Features for reference genome (BED, GFF, GTF)
options dict Additional options (e.g., {"threads": 8})

Important: - All file paths must be relative to the config file directory - At minimum, provide ref_genome_filename and reads


File Format Support

Genome and Plasmid Files

Format Extensions
FASTA .fa, .fasta, .fna
GenBank .gb, .gbk, .genbank

Feature Files

Format Extensions
GFF .gff, .gff3
GTF .gtf, .gff2
BED .bed

Read Files

Format Extensions
FASTQ .fastq, .fq
BAM .bam

Compression Support

All file types support transparent compression. The package automatically detects and handles compressed files.

Supported Compression Formats

Type Extensions Detection Performance
gzip .gz, .gzip Automatic Faster with pigz
bzip2 .bz2, .bzip2 Automatic Faster with pbzip2

Performance tip: Install pigz and pbzip2 for 3-4x faster compression/decompression:

# Ubuntu/Debian
sudo apt-get install pigz pbzip2

# macOS
brew install pigz pbzip2

Compression Examples

{
  "ref_genome_filename": {"filename": "genome.fasta.gz"},
  "reads": [
    {"filename": "reads.fastq.bz2", "ngs_type": "illumina"}
  ]
}

Configuration Objects

GenomeConfig Object

Specifies a genome or plasmid file.

Structure:

{
  "filename": "path/to/genome.fasta"
}

Optional fields:

{
  "filename": "path/to/genome.fasta.gz",
  "validation_level": "trust",
  "n_sequence_limit": 10
}

Field Type Default Description
filename string Path to genome file (relative to config)
validation_level string global / "trust" Per-file validation level override
threads integer global / auto Per-file thread count override
n_sequence_limit integer 5 Maximum allowed number of sequences. Applies to ref_genome_filename and mod_genome_filename only — ignored with a warning on plasmids. When the genome contains more sequences than this limit, the assembly is considered too fragmented: a warning is logged, the file is copied to data/valid/ as-is, and the pipeline will not run SyRI or ref-vs-mod comparison. Set higher for highly fragmented assemblies.

Example:

"mod_genome_filename": {
  "filename": "data/modified.fasta",
  "n_sequence_limit": 10
}

ReadConfig Object

Specifies sequencing read files. Supports individual files or directories.

Option 1: Individual Files

"reads": [
  {
    "filename": "samples/reads_R1.fastq.gz",
    "ngs_type": "illumina"
  },
  {
    "filename": "samples/reads_R2.fastq.gz",
    "ngs_type": "illumina"
  }
]

Option 2: Directory of Files

All files in the directory inherit the same settings. Each file becomes a separate read entry. This works for all NGS types (illumina, ont, pacbio).

"reads": [
  {
    "directory": "ont_reads/",
    "ngs_type": "ont",
    "validation_level": "trust"
  }
]

NGS Type Values

Value Description Typical Read Length
"illumina" Illumina short reads (SE or PE) 50-300 bp
"ont" Oxford Nanopore long reads 1-100+ kb
"pacbio" PacBio long reads 10-50+ kb

FeatureConfig Object

Specifies feature annotation files (BED, GFF, GTF).

Structure:

{
  "filename": "path/to/features.gff"
}

Optional fields (validator settings):

{
  "filename": "path/to/features.gff",
  "validation_level": "strict"
}

Example:

"ref_feature_filename": {"filename": "annotations/reference.gff"}


Validator Settings in Config

You can specify validator-specific settings directly in your config file at two levels:

  1. Global options (applies to ALL files)
  2. File-level settings (applies to specific files)

These settings customize validation behavior without modifying code.

Global Options (options field)

Allowed global options:

Option Type Default Description
threads integer / null auto-detect Number of threads. null or omit = auto-detect from CPU cores. System warns if value exceeds available cores.
validation_level string "trust" "strict", "trust", or "minimal"
logging_level string "INFO" "DEBUG", "INFO", "WARNING", or "ERROR"
type string "prokaryote" "prokaryote" or "eukaryote". Eukaryote skips inter-genome comparison.
force_defragment_ref boolean false Unsupported workaround — use at your own risk. Merges all reference contigs into one sequence before validation. Use only when the reference is too fragmented for any workflow. All downstream results (alignment, variant calling, feature mapping) may be incorrect or meaningless. See warning below.

⚠ Warning: force_defragment_ref

This option is an unsupported workaround and is used entirely at your own responsibility. The EFSA pipeline does not support fragmented reference genomes. Merging contigs artificially alters the coordinate space of the reference, which means all downstream results — inter-genome alignment, variant calling, feature coordinate mapping — may be incorrect or meaningless. Do not use these results for regulatory submissions or biological conclusions without expert review. Obtain a properly assembled reference genome instead.

When force_defragment_ref is set to true in config.json, it takes priority over the --force-defragment-ref CLI flag. When set to false in config.json, the CLI flag is ignored even if provided.

Join-order TSV output

When defragmentation runs, a TSV file is written alongside the merged FASTA:

data/outputs/tables/tsv/<basename>_defragmented_join_order.tsv
Column Description
seq_id Original contig ID from the source FASTA
length Length of that contig in base pairs
start 1-based start position of that contig in the merged sequence

Merge order is determined by the order sequences appear in the input FASTA file — the first record becomes the first segment, the second becomes the second, and so on. No sorting or reordering is applied. The start column encodes this order unambiguously: each contig begins immediately after the previous one ends, with no gaps or separators inserted between them.

This TSV is the only record of the original contig identities and their positions in the merged sequence. Keep it if you need to map coordinates back to the source contigs.

Example:

seq_id length start
contig_1 450000 1
contig_2 320000 450001
contig_3 180000 770001

Example config:

{
  "ref_genome_filename": {"filename": "genome.fasta"},
  "reads": [
    {"filename": "reads.fastq", "ngs_type": "illumina"}
  ],
  "options": {
    "threads": 8,
    "validation_level": "trust",
    "logging_level": "DEBUG",
    "type": "prokaryote",
    "force_defragment_ref": false
  }
}

Result: ALL files will use the specified options. Any option not set here falls back to its default.

Option priority

Options can be set from three sources. The first matching source wins:

config.json "options"  →  CLI flags (validation.sh)  →  built-in defaults

A value in config.json always takes priority over a CLI flag, and a CLI flag takes priority over the built-in default. This applies to all options including force_defragment_ref: if the config sets it to false, passing --force-defragment-ref on the command line has no effect.

Validation: - Invalid option names → ConfigurationError (e.g., "abc" not allowed) - Invalid threads → ConfigurationError (e.g., negative numbers) - Invalid validation_level → ConfigurationError (e.g., "invalid_level") - Invalid logging_level → ConfigurationError (e.g., "verbose", must be DEBUG/INFO/WARNING/ERROR) - Invalid type → ConfigurationError (e.g., "bacteria", must be "prokaryote" or "eukaryote") - Invalid force_defragment_ref → ConfigurationError (must be true or false, not a string)

File-Level Settings

Override global options for specific files by adding settings to individual file entries: - validation_level: "strict", "trust", or "minimal" (overrides global) - threads: Number of threads for compression (int, overrides global) - n_sequence_limit: Maximum number of sequences allowed in a genome file (int, default: 5). Applies to ref_genome_filename and mod_genome_filename only; ignored with a warning on plasmids.

Warnings: When a file-level setting overrides a global option, a WARNING is logged:

WARNING: File-level setting 'validation_level=strict' overrides global option 'validation_level=trust'


Examples

Minimal Configuration

Simplest valid configuration with required fields only:

{
  "ref_genome_filename": {"filename": "ref.fasta"},
  "reads": [
    {"filename": "reads.fastq", "ngs_type": "illumina"}
  ]
}

Full Configuration

Complete example with all optional fields and global options:

{
  "ref_genome_filename": {
    "filename": "ref.gbk",
    "validation_level": "strict",
    "threads": 8,
    "n_sequence_limit": 5
  },
  "mod_genome_filename": {
    "filename": "mod.fasta.gz",
    "validation_level": "strict",
    "threads": 8,
    "n_sequence_limit": 5
  },
  "ref_plasmid_filename": {
    "filename": "plasmid_ref.gbk",
    "validation_level": "strict",
    "threads": 8
    },
  "mod_plasmid_filename": {
    "filename": "plasmid_mod.fasta",
    "validation_level": "strict",
    "threads": 8
    },
  "reads": [
    {
      "filename": "illumina_R1.fastq.gz",
      "ngs_type": "illumina",
      "validation_level": "strict",
      "threads": 8
    },
    {
      "filename": "illumina_R2.fastq.gz",
      "ngs_type": "illumina",
      "validation_level": "strict",
      "threads": 8
    },
    {
      "directory": "ont_reads/",
      "ngs_type": "ont",
      "validation_level": "strict",
      "threads": 8
    }
  ],
  "ref_feature_filename": {
    "filename": "features_ref.gff3",
    "validation_level": "strict",
    "threads": 8
  },
  "options": {
    "threads": 8,
    "validation_level": "strict",
    "logging_level": "INFO",
    "type": "prokaryote"
  }
}