Evaluate gene annotations

InGenAnnot offers several tools to help you in your gene annotation evaluation and improvment. These tools are mainly based on the Annotation Evidence Distance (AED). Here we propose to use 3 tools of InGenAnnot to compare several versions of gene annotations based on:

  • annotation statistics (gene length, nb introns, …)

  • AED scores and associated categories for manual curation

  • Sequence Ontology Classification (SO)

Workflow:

digraph compare {
   "Prepare / Validate data / Global statistics" -> "Annotate AED with evidence";
   "Annotate AED with evidence" -> "Compare AED with evidence";
   "Annotate AED with evidence" -> "Compare CDS / loci";
   "Annotate AED with evidence" -> "Categorize for manual curation";
   "Annotate AED with evidence" -> "Explore results of SO Classification";
}

Steps:

1) Validate your annotations and output statistics

You validate the format of your annotations and you export global statistics. You are able to explore your dataset on gene length, size of introns, … You could compare with other related organisms, look at extrem values (very large UTR, …)

# validate the gene models and get statistics
ingenannot -v 2 validate genes.gff -g genome.fasta -s

2) Get metrics from AED scores

You can use aed_compare tool with only one gene set to get AED metrics as geometric median or distances to evaluate quality of annotations vs provided evidence.

# compare
# write file.fof
genes.aed.gff<TAB>src1

ingenannot.py -v 2 aed_compare file.fof -s

3) Get manual curation categories

We defined 7 categories to prioritize manual curation. The more you have transcripts in categories 1,2 and 3, the better your gene annotations fit with provided evidence.

ingenannot -v 2 curation genes.aed.gff genes.aed.curation.gff

4) Use SO classification to detect potential problematic regions

As described in (see soclassif), SO classification allows detection of gene/transcript positioning in relation to each other. You are able to detect regions with overlapping genes or problematic isoforms.