Select best gene models from predictions
The primary goal of InGenAnnot is to assist in gene selection and curation when using multiple gene predictors. Various tools are available for predicting gene structures with varying sensitivity and specificity. Similar to EvidenceModeler [1], InGenAnnot offers a tool to select the best gene model that fits transcriptomic or protein evidence. The selection is based on the Annotation Edit Distance (AED), as described by the authors of MAKER [2]. Unsupported splicing junctions can be penalized to maximize alignment with the evidence. If you require only evidence-supported gene models, you can set AED thresholds. Additionally, if you wish to rescue fully ab initio gene models, you can define a minimal number of sources to retain a gene at a locus without evidence.
Workflow:

Steps:
1) Generate / Assemble new transcripts from RNA-Seq / long reads
Exemple: assembly of new transcripts with StringTie [3] from paired-oriented reads mapped with STAR [5]:
stringtie Aligned.sortedByCoord.out.bam -l 11DPI -o transcripts.gff -m 1 50 --rf -p 12 -g 0 -f 0.1 -j 4 -a 10
If you have long-read sequencing such Iso-Seq PacBio data, prepare your transcripts in a gtf format using available pipeline such CupCake [6]. You can filtered out isoforms with a limited support based on RNA-seq data with isoform_ranking and keep the top isoform or the alternatives gff file (see isoform_ranking).
# write your file of files: bam.fof as below, such as "path to bam<tab>paired<tab>stranded"
/tmp/run1.singleton.stranded.bam False True
/tmp/run2.singleton.unstranded.bam False False
/tmp/run3.paired.stranded.bam True True
/tmp/run4.paired.unstranded.bam true False
/tmp/Aligned.sortedByCoord.out.bam true true
# your longreads.gff file must look like a gtf file, with several trancript for the same gene, as below:
chr_1 ingenannot-isoform-ranking transcript 140523 145168 . - . gene_id "PB.12"; transcript_id "PB.12.1";
chr_1 ingenannot-isoform-ranking exon 140523 145168 . - . gene_id "PB.12"; transcript_id "PB.12.1";
chr_1 ingenannot-isoform-ranking transcript 140531 145121 . - . gene_id "PB.12"; transcript_id "PB.12.2";
chr_1 ingenannot-isoform-ranking exon 140531 145121 . - . gene_id "PB.12"; transcript_id "PB.12.2";
chr_1 ingenannot-isoform-ranking transcript 140579 144883 . - . gene_id "PB.12"; transcript_id "PB.12.3";
chr_1 ingenannot-isoform-ranking exon 140579 144883 . - . gene_id "PB.12"; transcript_id "PB.12.3";
chr_1 ingenannot-isoform-ranking transcript 140708 144901 . - . gene_id "PB.12"; transcript_id "PB.12.4";
chr_1 ingenannot-isoform-ranking exon 140708 144901 . - . gene_id "PB.12"; transcript_id "PB.12.4";
chr_1 ingenannot-isoform-ranking transcript 202365 205520 . - . gene_id "PB.22"; transcript_id "PB.22.1";
chr_1 ingenannot-isoform-ranking exon 202365 205031 . - . gene_id "PB.22"; transcript_id "PB.22.1";
chr_1 ingenannot-isoform-ranking exon 205203 205520 . - . gene_id "PB.22"; transcript_id "PB.22.1";
chr_1 ingenannot-isoform-ranking transcript 202939 205422 . - . gene_id "PB.22"; transcript_id "PB.22.2";
chr_1 ingenannot-isoform-ranking exon 202939 205422 . - . gene_id "PB.22"; transcript_id "PB.22.2";
chr_1 ingenannot-isoform-ranking transcript 202939 205488 . - . gene_id "PB.22"; transcript_id "PB.22.3";
chr_1 ingenannot-isoform-ranking exon 202939 205031 . - . gene_id "PB.22"; transcript_id "PB.22.3";
chr_1 ingenannot-isoform-ranking exon 205203 205488 . - . gene_id "PB.22"; transcript_id "PB.22.3";
# isofom ranking with multiple bam files
ingenannot -v 2 isoform_ranking longreads.gff -f bam.fof --alt_threshold 0.1
2) Generate protein matches
Exemple: alignment of known proteins onto your genome with exonerate [3]
# run exonerate with specific parameters
exonerate --model protein2genome --showvulgar no --showalignment no --showtargetgff yes --showquerygff no --minintron 4 --maxintron 5000 --percent 50 --ryo "AveragePercentIdentity: %pi\n" myproteins.fasta genome.fasta > exonerate.out
# export in gff format
ingenannot -v 2 exonerate_to_gff exonerate.out > proteins.gff
If you want to speed up protein alignment you can use miniprot [4] and ask an export in gff file such as:
# run minprot with specific parameters
miniprot -t 8 -G 5000 --gff genome.fasta myproteins.fasta > proteins.gff
You will be able to directly use this gff file in select and aed commands with specific parameter ‘–evpr_source miniprot’. Note: Do not use the ‘validate’ tool for “proteins.gff” file generated with miniprot. Some tags as ID are missing and will be added by the program.
Note: you can use other tools like Diamond, ProtHint. You just have to provide a gff file in alignment format such as:
chr_1 exonerate_to_gff match 2 100 100.00 + . ID=match.95357;Dbxref=exonerate:0;Name=tr|A0A4Q4NES4|A0A4Q4NES4_9PLEO
chr_1 exonerate_to_gff match_part 2 100 100.00 + . ID=match.95357.0;Parent=match.95357;Dbxref=exonerate:tr|A0A4Q4NES4|A0A4Q4NES4_9PLEO;Target=tr|A0A4Q4NES4|A0A4Q4NES4_9PLEO 11 44
chr_1 exonerate_to_gff match 176525 177163 71.58 + . ID=match.117446;Dbxref=exonerate:1;Name=tr|M2UAA1|M2UAA1_COCH5
chr_1 exonerate_to_gff match_part 176525 176963 71.58 + . ID=match.117446.0;Parent=match.117446;Dbxref=exonerate:tr|M2UAA1|M2UAA1_COCH5;Target=tr|M2UAA1|M2UAA1_COCH5 33 101
chr_1 exonerate_to_gff match_part 177030 177163 71.58 + . ID=match.117446.1;Parent=match.117446;Dbxref=exonerate:tr|M2UAA1|M2UAA1_COCH5;Target=tr|M2UAA1|M2UAA1_COCH5 103 181
3) Prepare your data and validate them
# validate the gene models for each predictor / source
ingenannot -v 2 validate src1.gff
ingenannot -v 2 validate src2.gff
ingenannot -v 2 validate src3.gff
# validate the evidence files
ingenannot -v 2 validate transcripts.gff
ingenannot -v 2 validate proteins.gff
# longreads.gff could be your raw file or top/alternatives from isoform_ranking
# we recommend to use the top and keep the alternatives for possible isoforms
ingenannot -v 2 validate longreads.gff
# prepare the data
sort -k1,1 -k4g,4 transcripts.gff > transcripts.sorted.gff
bgzip transcripts.sorted.gff
tabix -p gff transcripts.sorted.gff.gz
sort -k1,1 -k4g,4 longreads.gff > longreads.sorted.gff
bgzip longreads.sorted.gff
tabix -p gff longreads.sorted.gff.gz
sort -k1,1 -k4g,4 proteins.gff > proteins.sorted.gff
bgzip proteins.sorted.gff
tabix -p gff proteins.sorted.gff.gz
3.b) Filter TEs genes
If your gene predictions do not care about TEs, filter them before AED and gene selection.
# filter gene models of each predictor / source overlapping TEs annotations
ingenannot -v 2 filter src1.gff TEs.gff src1.noTE.gff3 -f match_part
ingenannot -v 2 filter src2.gff TEs.gff src2.noTE.gff3 -f match_part
ingenannot -v 2 filter src3.gff TEs.gff src3.noTE.gff3 -f match_part
4) Run aed on each gene dataset
# run aed on each source, use specific parameters if necessary, but keep the same for all sources
ingenannot -v 2 -p 10 aed src1.gff src1.aed.gff src1 transcripts.sorted.gff.gz proteins.sorted.gff.gz --longreads longreads.sorted.gff.gz --evtrstranded --longreads_source "PacBio" --penalty_overflow 0.2 --aed_tr_cds_only
ingenannot -v 2 -p 10 aed src2.gff src2.aed.gff src2 transcripts.sorted.gff.gz proteins.sorted.gff.gz --longreads longreads.sorted.gff.gz --evtrstranded --longreads_source "PacBio" --penalty_overflow 0.2 --aed_tr_cds_only
ingenannot -v 2 -p 10 aed src3.gff src3.aed.gff src3 transcripts.sorted.gff.gz proteins.sorted.gff.gz --longreads longreads.sorted.gff.gz --evtrstranded --longreads_source "PacBio" --penalty_overflow 0.2 --aed_tr_cds_only
5) Run select on annotated gene dataset
# write genes.fof
src1.aed.gff<TAB>src1
src2.aed.gff<TAB>src2
src3.aed.gff<TAB>src3
# run select with specific parameters, see doc
ingenannot -v 2 -p 10 select genes.fof select.genes.gff --noaedtr --clustranded --nbsrc_filter 2 --aedtr 0.4 --aedpr 0.2 --use_ev_lg--use_ev_lg --min_cds_len 100 --no_partial --genome genome.fasta --no_cds_overlap
6) Compare the selection with original datasets
# write compare.fof
src1.aed.gff<TAB>src1
src2.aed.gff<TAB>src2
src3.aed.gff<TAB>src3
select.genes.gff<TAB>select
# run compare
ingenannot -v 2 -p 10 compare compare.fof
For a more complete comparison, you can follow the use case Compare / Evaluate different annotation datasets.
Then you can perform extra-steps:
Add non-predicted Small Seceted Proteins: If you are working on a fungal genome with potential SSP, you can try to add unpredicted SSP (see rescue_effectors).
Add UTRs: You can now add UTRs to your gene models following this protocole: add_UTRs.
Add Isoforms: Add isoforms to your annotation with add new isoforms.