aed

Command aed annotates gene models with AED scores from transcript (RNA-Seq, long-reads) and protein evidence.

usage

$ ingenannot -v 2 aed genes.gff genes.aed.gff predictor rnaseq.gff.gz protein.gff.gz --evtrstranded --evprstranded

positional arguments:

Input

GFF/GTF File

Output

Output Annotation file in GFF file format with AED

source

Source of Annotation (eugene, maker, braker3, helixer…)

evtr

Gff file of transcript evidence, compressed with bgzip and indexed with tabix

evpr

Gff file protein evidence, compressed with bgzip and indexed with tabix

optional arguments:

-h, –help

show this help message and exit

–evtr_source EVTR_SOURCE

Source for Gff file transcript evidence ex “stringtie”, default=undefined

–evpr_source EVPR_SOURCE

Source for Gff file protein evidence ex “blastx, diamond, exonerate, miniprot”, default=undefined

–evtrstranded

Same strand orientation required to consider match with evidence, default=False

–evprstranded

Same strand orientation required to consider match with evidence, default=False

–penalty_overflow PENALTY_OVERFLOW

In the event that a Coding DNA Sequence (CDS) exceeds the expected length or violates intron constraints based on transcript evidence, a penalty should be applied to computation of the Annotation Edit Distance (AED) score. The penalty value ranges from 0.0 to 1.0, default=0.0, no penalty

–longreads LONGREADS

Gff file longread based transcript evidence, compressed and indexed with tabix

–longreads_source LONGREADS_SOURCE

Source for Gff file longread based evidence ex “Iso-Seq”, default=undefined

–longreads_penalty_overflow LONGREADS_PENALTY_OVERFLOW

In the event that a Coding DNA Sequence (CDS) exceeds the expected length or violates intron constraints based on transcript evidence, a penalty should be applied to computation of the Annotation Edit Distance (AED) score. The penalty value ranges from 0.0 to 1.0, default=0.25

–aedtr AEDTR

Transcript AED value for graph limits, default=0.5

–aedpr AEDPR

Protein AED value for graph limits, default=0.2

–aed_tr_cds_only

For transcripts (short-reads and longreads), compute AED on CDS only, instead of Exon and CDS, with best score selection, default=False

inputs

Gff_genes in GFF/GTF format. Source of evidence in gff file format compressed in bgzip and indexed with tabix such as:

# example of transcripts
chr_1   StringTie       exon    229833  230125  1000    +       .       gene_id "SRR8788921.34"; transcript_id "SRR8788921.24.1_1-1693"; exon_number "1"; cov "34.847462";
chr_1   StringTie       transcript      229833  230125  1000    +       .       gene_id "SRR8788921.34"; transcript_id "SRR8788921.24.1_1-1693"; cov "34.847462"; FPKM "13.868441"; TPM "17.033602";
chr_1   StringTie       exon    1054358 1057759 1000    +       .       gene_id "SCA3419A32.234"; transcript_id "SCA3419A32.219.1"; exon_number "1"; cov "89.961784";
chr_1   StringTie       transcript      1054358 1057759 1000    +       .       gene_id "SCA3419A32.234"; transcript_id "SCA3419A32.219.1"; cov "89.961784"; FPKM "8.020829"; TPM "11.758745";
chr_1   StringTie       exon    5310080 5314987 1000    -       .       gene_id "SRR6215483.1607"; transcript_id "SRR6215483.1600.1"; exon_number "1"; cov "13.255298";

# sort / index evidence
sort -k1,1 -k4,4g transcript.gff > transcript.sort.gff
bgzip transcript.sort.gff
tabix -p gff transcript.sort.gff.gz

outputs

The output are the gff file annotated with AED scores for each source of evidence with the best match (score and evidence name provided) and a graphical representation of the AED scores for all genes (AED plot).

# input gff:
 chr_1   ingenannot      gene    5318821 5320515 .       -       .       ID=gene:curtin.2067;
 chr_1   ingenannot      mRNA    5318821 5320515 .       -       .       ID=mRNA:curtin.2067.1;source=CURTIN;Parent=gene:curtin.2067;
 chr_1   ingenannot      exon    5318821 5320515 .       -       .       ID=exon:mRNA:curtin.2067.1.1;Parent=mRNA:curtin.2067.1;
 chr_1   ingenannot      CDS     5318821 5320515 .       -       0       ID=cds:mRNA:curtin.2067.1;Parent=mRNA:curtin.2067.1;

# output gff:
chr_1   ingenannot      gene    5318821 5320515 .       -       .       ID=gene:curtin.2067;source=CURTIN;
chr_1   ingenannot      mRNA    5318821 5320515 .       -       .       ID=mRNA:curtin.2067.1;source=CURTIN;Parent=gene:curtin.2067;ev_tr=SRR8788924.1287.1;aed_ev_tr=0.0366;ev_tr_penalty=undef;ev_pr=match.75763;aed_ev_pr=0.0478;ev_lg=PB.1516.1;aed_ev_lg=0.0468;ev_lg_penalty=no;
chr_1   ingenannot      exon    5318821 5320515 .       -       .       ID=exon:mRNA:curtin.2067.1.1;source=CURTIN;Parent=mRNA:curtin.2067.1;
chr_1   ingenannot      CDS     5318821 5320515 .       -       0       ID=cds:mRNA:curtin.2067.1;source=CURTIN;Parent=mRNA:curtin.2067.1;

Output AED plot (corresponding plot of the output gff file above):

AED plot

An AED plot from a real transcript dataset:

AED plot

In white text boxes, all transcripts in the area (between red dashed line). In blue text boxes, all transcripts without penalty (violates intron constraints). In the middle, scatter plot of AED scores. On each axis, histogram of AED transcript and protein scores.