Supported File Formats¶

The EFSA Pipeline validation module supports various file formats for genomic data input.

Genome Files¶

Supported Input Formats	Final Output Format
FASTA: `.fasta`, `.fa`, `.fna`	`.fasta`
GenBank: `.gb`, `.gbk`, `.genbank`	`.fasta`
Compression: `.gz`, `.bz2`, `.gzip`, `.bzip2`	Uncompressed

Details¶

GenBank files are automatically converted to FASTA format
Compressed files are decompressed during validation
Output is always uncompressed FASTA format

Read Files¶

Supported Input Formats	Final Output Format
FASTQ: `.fastq`, `.fq`	`.fastq`
BAM: `.bam` (limited support)	`.bam`
Compression: `.gz`, `.bz2`, `.gzip`, `.bzip2`	`.gz`

Details¶

FASTQ files remain in FASTQ format
Compressed reads are kept compressed (using gzip)
BAM files have limited support and require special handling

Feature Files¶

Supported Input Formats	Final Output Format
GFF: `.gff`, `.gff3`, `.gtf`	`.gff3`
BED: `.bed`	`.gff3`
Compression: `.gz`, `.bz2`, `.gzip`, `.bzip2`	Uncompressed

Details¶

All feature formats are converted to GFF3 via gffread
BED files are automatically converted to GFF3 format
Output is always uncompressed
If gffread fails or returns 0 features, the validator falls back to direct GFF3 parsing
If both gffread and the fallback parser return 0 features, validation fails: no output file is created and run_vcf_annotation is set to false in validated_params.json
Coordinate validation (start >= 1, start <= end) runs in strict mode; issues are reported as warnings

File Organization¶

Input files should be placed in:

data/inputs/

After validation, files are organized in:

data/valid/
├── assembled_genome.fasta
├── reference_genome.fasta
├── ref_plasmid.fa
├── mod_plasmid.fa
├── ref_feature.gff
├── illumina/
├── ont/
└── pacbio/

Supported File Formats¶

Genome Files¶

Details¶

Read Files¶

Details¶

Feature Files¶

Details¶

File Organization¶

See Also¶