Supported File Formats¶
The EFSA Pipeline validation module supports various file formats for genomic data input.
Genome Files¶
| Supported Input Formats | Final Output Format |
|---|---|
FASTA: .fasta, .fa, .fna |
.fasta |
GenBank: .gb, .gbk, .genbank |
.fasta |
Compression: .gz, .bz2, .gzip, .bzip2 |
Uncompressed |
Details¶
- GenBank files are automatically converted to FASTA format
- Compressed files are decompressed during validation
- Output is always uncompressed FASTA format
Read Files¶
| Supported Input Formats | Final Output Format |
|---|---|
FASTQ: .fastq, .fq |
.fastq |
BAM: .bam (limited support) |
.bam |
Compression: .gz, .bz2, .gzip, .bzip2 |
.gz |
Details¶
- FASTQ files remain in FASTQ format
- Compressed reads are kept compressed (using gzip)
- BAM files have limited support and require special handling
Feature Files¶
| Supported Input Formats | Final Output Format |
|---|---|
GFF: .gff, .gff3, .gtf |
.gff3 |
BED: .bed |
.gff3 |
Compression: .gz, .bz2, .gzip, .bzip2 |
Uncompressed |
Details¶
- All feature formats are converted to GFF3 via
gffread - BED files are automatically converted to GFF3 format
- Output is always uncompressed
- If
gffreadfails or returns 0 features, the validator falls back to direct GFF3 parsing - If both
gffreadand the fallback parser return 0 features, validation fails: no output file is created andrun_vcf_annotationis set tofalseinvalidated_params.json - Coordinate validation (
start >= 1,start <= end) runs in strict mode; issues are reported as warnings
File Organization¶
Input files should be placed in:
After validation, files are organized in:
data/valid/
├── assembled_genome.fasta
├── reference_genome.fasta
├── ref_plasmid.fa
├── mod_plasmid.fa
├── ref_feature.gff
├── illumina/
├── ont/
└── pacbio/
See Also¶
- Configuration Guide - How to specify file paths
- Validation Overview - Validation process details