Evaluate gene annotations
InGenAnnot offers several tools to help you in your gene annotation evaluation and improvment. These tools are mainly based on the Annotation Evidence Distance (AED). Here we propose to use 3 tools of InGenAnnot to compare several versions of gene annotations based on:
annotation statistics (gene length, nb introns, …)
AED scores and associated categories for manual curation
Sequence Ontology Classification (SO)
Workflow:
Steps:
1) Validate your annotations and output statistics
You validate the format of your annotations and you export global statistics. You are able to explore your dataset on gene length, size of introns, … You could compare with other related organisms, look at extrem values (very large UTR, …)
# validate the gene models and get statistics
ingenannot -v 2 validate genes.gff -g genome.fasta -s
2) Get metrics from AED scores
You can use aed_compare tool with only one gene set to get AED metrics as geometric median or distances to evaluate quality of annotations vs provided evidence.
# compare
# write file.fof
genes.aed.gff<TAB>src1
ingenannot.py -v 2 aed_compare file.fof -s
3) Get manual curation categories
We defined 7 categories to prioritize manual curation. The more you have transcripts in categories 1,2 and 3, the better your gene annotations fit with provided evidence.
ingenannot -v 2 curation genes.aed.gff genes.aed.curation.gff
4) Use SO classification to detect potential problematic regions
As described in (see soclassif), SO classification allows detection of gene/transcript positioning in relation to each other. You are able to detect regions with overlapping genes or problematic isoforms.