aed
Command aed annotates gene models with AED scores from transcript (RNA-Seq, long-reads) and protein evidence.
usage
$ ingenannot -v 2 aed genes.gff genes.aed.gff predictor rnaseq.gff.gz protein.gff.gz --evtrstranded --evprstranded
positional arguments:
Input |
GFF/GTF File |
Output |
Output Annotation file in GFF file format with AED |
source |
Source of Annotation (eugene, maker, braker3, helixer…) |
evtr |
Gff file of transcript evidence, compressed with bgzip and indexed with tabix |
evpr |
Gff file protein evidence, compressed with bgzip and indexed with tabix |
optional arguments:
-h, –help |
show this help message and exit |
–evtr_source EVTR_SOURCE |
Source for Gff file transcript evidence ex “stringtie”, default=undefined |
–evpr_source EVPR_SOURCE |
Source for Gff file protein evidence ex “blastx, diamond, exonerate, miniprot”, default=undefined |
–evtrstranded |
Same strand orientation required to consider match with evidence, default=False |
–evprstranded |
Same strand orientation required to consider match with evidence, default=False |
–penalty_overflow PENALTY_OVERFLOW |
In the event that a Coding DNA Sequence (CDS) exceeds the expected length or violates intron constraints based on transcript evidence, a penalty should be applied to computation of the Annotation Edit Distance (AED) score. The penalty value ranges from 0.0 to 1.0, default=0.0, no penalty |
–longreads LONGREADS |
Gff file longread based transcript evidence, compressed and indexed with tabix |
–longreads_source LONGREADS_SOURCE |
Source for Gff file longread based evidence ex “Iso-Seq”, default=undefined |
–longreads_penalty_overflow LONGREADS_PENALTY_OVERFLOW |
In the event that a Coding DNA Sequence (CDS) exceeds the expected length or violates intron constraints based on transcript evidence, a penalty should be applied to computation of the Annotation Edit Distance (AED) score. The penalty value ranges from 0.0 to 1.0, default=0.25 |
–aedtr AEDTR |
Transcript AED value for graph limits, default=0.5 |
–aedpr AEDPR |
Protein AED value for graph limits, default=0.2 |
–aed_tr_cds_only |
For transcripts (short-reads and longreads), compute AED on CDS only, instead of Exon and CDS, with best score selection, default=False |
inputs
Gff_genes in GFF/GTF format. Source of evidence in gff file format compressed in bgzip and indexed with tabix such as:
# example of transcripts
chr_1 StringTie exon 229833 230125 1000 + . gene_id "SRR8788921.34"; transcript_id "SRR8788921.24.1_1-1693"; exon_number "1"; cov "34.847462";
chr_1 StringTie transcript 229833 230125 1000 + . gene_id "SRR8788921.34"; transcript_id "SRR8788921.24.1_1-1693"; cov "34.847462"; FPKM "13.868441"; TPM "17.033602";
chr_1 StringTie exon 1054358 1057759 1000 + . gene_id "SCA3419A32.234"; transcript_id "SCA3419A32.219.1"; exon_number "1"; cov "89.961784";
chr_1 StringTie transcript 1054358 1057759 1000 + . gene_id "SCA3419A32.234"; transcript_id "SCA3419A32.219.1"; cov "89.961784"; FPKM "8.020829"; TPM "11.758745";
chr_1 StringTie exon 5310080 5314987 1000 - . gene_id "SRR6215483.1607"; transcript_id "SRR6215483.1600.1"; exon_number "1"; cov "13.255298";
# sort / index evidence
sort -k1,1 -k4,4g transcript.gff > transcript.sort.gff
bgzip transcript.sort.gff
tabix -p gff transcript.sort.gff.gz
outputs
The output are the gff file annotated with AED scores for each source of evidence with the best match (score and evidence name provided) and a graphical representation of the AED scores for all genes (AED plot).
# input gff:
chr_1 ingenannot gene 5318821 5320515 . - . ID=gene:curtin.2067;
chr_1 ingenannot mRNA 5318821 5320515 . - . ID=mRNA:curtin.2067.1;source=CURTIN;Parent=gene:curtin.2067;
chr_1 ingenannot exon 5318821 5320515 . - . ID=exon:mRNA:curtin.2067.1.1;Parent=mRNA:curtin.2067.1;
chr_1 ingenannot CDS 5318821 5320515 . - 0 ID=cds:mRNA:curtin.2067.1;Parent=mRNA:curtin.2067.1;
# output gff:
chr_1 ingenannot gene 5318821 5320515 . - . ID=gene:curtin.2067;source=CURTIN;
chr_1 ingenannot mRNA 5318821 5320515 . - . ID=mRNA:curtin.2067.1;source=CURTIN;Parent=gene:curtin.2067;ev_tr=SRR8788924.1287.1;aed_ev_tr=0.0366;ev_tr_penalty=undef;ev_pr=match.75763;aed_ev_pr=0.0478;ev_lg=PB.1516.1;aed_ev_lg=0.0468;ev_lg_penalty=no;
chr_1 ingenannot exon 5318821 5320515 . - . ID=exon:mRNA:curtin.2067.1.1;source=CURTIN;Parent=mRNA:curtin.2067.1;
chr_1 ingenannot CDS 5318821 5320515 . - 0 ID=cds:mRNA:curtin.2067.1;source=CURTIN;Parent=mRNA:curtin.2067.1;
Output AED plot (corresponding plot of the output gff file above):

An AED plot from a real transcript dataset:

In white text boxes, all transcripts in the area (between red dashed line). In blue text boxes, all transcripts without penalty (violates intron constraints). In the middle, scatter plot of AED scores. On each axis, histogram of AED transcript and protein scores.