Select best gene models from predictions

The primary goal of InGenAnnot is to assist in gene selection and curation when using multiple gene predictors. Various tools are available for predicting gene structures with varying sensitivity and specificity. Similar to EvidenceModeler [1], InGenAnnot offers a tool to select the best gene model that fits transcriptomic or protein evidence. The selection is based on the Annotation Edit Distance (AED), as described by the authors of MAKER [2]. Unsupported splicing junctions can be penalized to maximize alignment with the evidence. If you require only evidence-supported gene models, you can set AED thresholds. Additionally, if you wish to rescue fully ab initio gene models, you can define a minimal number of sources to retain a gene at a locus without evidence.

Workflow:

digraph SSP {
   "Assemble transcripts" -> "Prepare / Validate data";
   "Select long-read isoforms" -> "Prepare / Validate data";
   "Align proteins" -> "Prepare / Validate data";
   "Prepare / Validate data" -> "AED annotation";
   "Prepare / Validate data" -> "Filter TEs genes";
   "Filter TEs genes" -> "AED annotation";
   "AED annotation" -> "Select";
   "Select" -> "Compare the selection with all sources";
}

Steps:

1) Generate / Assemble new transcripts from RNA-Seq / long reads

Exemple: assembly of new transcripts with StringTie [3] from paired-oriented reads mapped with STAR [5]:

stringtie Aligned.sortedByCoord.out.bam -l 11DPI -o transcripts.gff -m 1 50 --rf -p 12 -g 0 -f 0.1 -j 4 -a 10

If you have long-read sequencing such Iso-Seq PacBio data, prepare your transcripts in a gtf format using available pipeline such CupCake [6]. You can filtered out isoforms with a limited support based on RNA-seq data with isoform_ranking and keep the top isoform or the alternatives gff file (see isoform_ranking).

# write your file of files: bam.fof as below, such as "path to bam<tab>paired<tab>stranded"
/tmp/run1.singleton.stranded.bam    False   True
/tmp/run2.singleton.unstranded.bam  False   False
/tmp/run3.paired.stranded.bam  True  True
/tmp/run4.paired.unstranded.bam  true   False
/tmp/Aligned.sortedByCoord.out.bam true true


# your longreads.gff file must look like a gtf file, with several trancript for the same gene, as below:
chr_1       ingenannot-isoform-ranking      transcript      140523  145168  .       -       .       gene_id "PB.12"; transcript_id "PB.12.1";
chr_1       ingenannot-isoform-ranking      exon    140523  145168  .       -       .       gene_id "PB.12"; transcript_id "PB.12.1";
chr_1       ingenannot-isoform-ranking      transcript      140531  145121  .       -       .       gene_id "PB.12"; transcript_id "PB.12.2";
chr_1       ingenannot-isoform-ranking      exon    140531  145121  .       -       .       gene_id "PB.12"; transcript_id "PB.12.2";
chr_1       ingenannot-isoform-ranking      transcript      140579  144883  .       -       .       gene_id "PB.12"; transcript_id "PB.12.3";
chr_1       ingenannot-isoform-ranking      exon    140579  144883  .       -       .       gene_id "PB.12"; transcript_id "PB.12.3";
chr_1       ingenannot-isoform-ranking      transcript      140708  144901  .       -       .       gene_id "PB.12"; transcript_id "PB.12.4";
chr_1       ingenannot-isoform-ranking      exon    140708  144901  .       -       .       gene_id "PB.12"; transcript_id "PB.12.4";
chr_1       ingenannot-isoform-ranking      transcript      202365  205520  .       -       .       gene_id "PB.22"; transcript_id "PB.22.1";
chr_1       ingenannot-isoform-ranking      exon    202365  205031  .       -       .       gene_id "PB.22"; transcript_id "PB.22.1";
chr_1       ingenannot-isoform-ranking      exon    205203  205520  .       -       .       gene_id "PB.22"; transcript_id "PB.22.1";
chr_1       ingenannot-isoform-ranking      transcript      202939  205422  .       -       .       gene_id "PB.22"; transcript_id "PB.22.2";
chr_1       ingenannot-isoform-ranking      exon    202939  205422  .       -       .       gene_id "PB.22"; transcript_id "PB.22.2";
chr_1       ingenannot-isoform-ranking      transcript      202939  205488  .       -       .       gene_id "PB.22"; transcript_id "PB.22.3";
chr_1       ingenannot-isoform-ranking      exon    202939  205031  .       -       .       gene_id "PB.22"; transcript_id "PB.22.3";
chr_1       ingenannot-isoform-ranking      exon    205203  205488  .       -       .       gene_id "PB.22"; transcript_id "PB.22.3";


# isofom ranking with multiple bam files
ingenannot -v 2 isoform_ranking longreads.gff -f bam.fof --alt_threshold 0.1

2) Generate protein matches

Exemple: alignment of known proteins onto your genome with exonerate [3]

# run exonerate with specific parameters
exonerate --model protein2genome --showvulgar no --showalignment no --showtargetgff yes --showquerygff no --minintron 4 --maxintron 5000 --percent 50 --ryo "AveragePercentIdentity: %pi\n"  myproteins.fasta genome.fasta > exonerate.out

# export in gff format
ingenannot -v 2 exonerate_to_gff exonerate.out > proteins.gff

If you want to speed up protein alignment you can use miniprot [4] and ask an export in gff file such as:

# run minprot with specific parameters
miniprot -t 8 -G 5000 --gff genome.fasta myproteins.fasta > proteins.gff

You will be able to directly use this gff file in select and aed commands with specific parameter ‘–evpr_source miniprot’. Note: Do not use the ‘validate’ tool for “proteins.gff” file generated with miniprot. Some tags as ID are missing and will be added by the program.

Note: you can use other tools like Diamond, ProtHint. You just have to provide a gff file in alignment format such as:

chr_1   exonerate_to_gff        match   2       100     100.00  +       .       ID=match.95357;Dbxref=exonerate:0;Name=tr|A0A4Q4NES4|A0A4Q4NES4_9PLEO
chr_1   exonerate_to_gff        match_part      2       100     100.00  +       .       ID=match.95357.0;Parent=match.95357;Dbxref=exonerate:tr|A0A4Q4NES4|A0A4Q4NES4_9PLEO;Target=tr|A0A4Q4NES4|A0A4Q4NES4_9PLEO 11 44
chr_1   exonerate_to_gff        match   176525  177163  71.58   +       .       ID=match.117446;Dbxref=exonerate:1;Name=tr|M2UAA1|M2UAA1_COCH5
chr_1   exonerate_to_gff        match_part      176525  176963  71.58   +       .       ID=match.117446.0;Parent=match.117446;Dbxref=exonerate:tr|M2UAA1|M2UAA1_COCH5;Target=tr|M2UAA1|M2UAA1_COCH5 33 101
chr_1   exonerate_to_gff        match_part      177030  177163  71.58   +       .       ID=match.117446.1;Parent=match.117446;Dbxref=exonerate:tr|M2UAA1|M2UAA1_COCH5;Target=tr|M2UAA1|M2UAA1_COCH5 103 181

3) Prepare your data and validate them

# validate the gene models for each predictor / source
ingenannot -v 2 validate src1.gff
ingenannot -v 2 validate src2.gff
ingenannot -v 2 validate src3.gff

# validate the evidence files
ingenannot -v 2 validate transcripts.gff
ingenannot -v 2 validate proteins.gff

# longreads.gff could be your raw file or top/alternatives from isoform_ranking
# we recommend to use the top and keep the alternatives for possible isoforms
ingenannot -v 2 validate longreads.gff

# prepare the data
sort -k1,1 -k4g,4 transcripts.gff > transcripts.sorted.gff
bgzip transcripts.sorted.gff
tabix -p gff transcripts.sorted.gff.gz
sort -k1,1 -k4g,4 longreads.gff > longreads.sorted.gff
bgzip longreads.sorted.gff
tabix -p gff longreads.sorted.gff.gz
sort -k1,1 -k4g,4 proteins.gff > proteins.sorted.gff
bgzip proteins.sorted.gff
tabix -p gff proteins.sorted.gff.gz

3.b) Filter TEs genes

If your gene predictions do not care about TEs, filter them before AED and gene selection.

# filter gene models of each predictor / source overlapping TEs annotations
ingenannot -v 2 filter src1.gff TEs.gff src1.noTE.gff3 -f match_part
ingenannot -v 2 filter src2.gff TEs.gff src2.noTE.gff3 -f match_part
ingenannot -v 2 filter src3.gff TEs.gff src3.noTE.gff3 -f match_part

4) Run aed on each gene dataset

# run aed on each source, use specific parameters if necessary, but keep the same for all sources
ingenannot -v 2 -p 10 aed src1.gff src1.aed.gff src1 transcripts.sorted.gff.gz proteins.sorted.gff.gz --longreads longreads.sorted.gff.gz --evtrstranded --longreads_source "PacBio" --penalty_overflow 0.2 --aed_tr_cds_only
ingenannot -v 2 -p 10 aed src2.gff src2.aed.gff src2 transcripts.sorted.gff.gz proteins.sorted.gff.gz --longreads longreads.sorted.gff.gz --evtrstranded --longreads_source "PacBio" --penalty_overflow 0.2 --aed_tr_cds_only
ingenannot -v 2 -p 10 aed src3.gff src3.aed.gff src3 transcripts.sorted.gff.gz proteins.sorted.gff.gz --longreads longreads.sorted.gff.gz --evtrstranded --longreads_source "PacBio" --penalty_overflow 0.2 --aed_tr_cds_only

5) Run select on annotated gene dataset

# write genes.fof
src1.aed.gff<TAB>src1
src2.aed.gff<TAB>src2
src3.aed.gff<TAB>src3

# run select with specific parameters, see doc
ingenannot -v 2 -p 10 select genes.fof select.genes.gff --noaedtr --clustranded --nbsrc_filter 2 --aedtr 0.4 --aedpr 0.2 --use_ev_lg--use_ev_lg --min_cds_len 100 --no_partial --genome genome.fasta --no_cds_overlap

6) Compare the selection with original datasets

# write compare.fof
src1.aed.gff<TAB>src1
src2.aed.gff<TAB>src2
src3.aed.gff<TAB>src3
select.genes.gff<TAB>select

# run compare
ingenannot -v 2 -p 10 compare compare.fof

For a more complete comparison, you can follow the use case Compare / Evaluate different annotation datasets.

Then you can perform extra-steps:

  • Add non-predicted Small Seceted Proteins: If you are working on a fungal genome with potential SSP, you can try to add unpredicted SSP (see rescue_effectors).

  • Add UTRs: You can now add UTRs to your gene models following this protocole: add_UTRs.

  • Add Isoforms: Add isoforms to your annotation with add new isoforms.

References