Add UTRs to gene models

Here we propose a protocole to add UTRs to your gene models based on long-read and RNA-Seq data. We will prepare the data and use both of datatype to add UTRs. We suppose that the long-read data provide a better/ more reliable definition of UTRs part, compare to transcript assemblies obtained from RNA-Seq data. We will firstly use long-read data ranked by most supported isoforms with RNA-Seq data. Then add potential new UTRs with RNA-Seq transcripts if not still annotated. In case you do not have one type of data, you can limit the protocole to the available data.

Workflow:

digraph UTRs {
   "Rank long-read transcript isoforms" -> "Add / Refine UTRs";
   "Rank short-read transcript isoforms" -> "Add / Refine UTRs";
   "Clusterize transcripts" -> "Rank short-read transcript isoforms";
}

1) Add UTRs from long-read data if available

# rank your long-reads based on junction support and coverage
ingenannot -v 2 isoform_ranking longreads.gff -f file.fof --alt_threshold 0.1 --rescue

# add UTR with long reads using rank as preferred isoform
ingenannot.py -v 2 utr_refine genes.gff3 isoforms.alternatives.gff genes.utrs.gff3 --erase --utr_mode rank

2) Add UTRs from short-read assemblies

# add UTR with short reads in onlynew mode
# if you want to combine several transcript assemblies from several runs,
# you have to merge the transcripts comming from the same gene
# To merge your transcript file, you can use Stringtie in merge mode.
# If you want to be sure to avoid trancript with multiple CDS
# overlaps, perform as described below:

cat assembly.1.gff assembly.2.gff assembly.3.gff assembly.4.gff ... > all_transcripts.gff

# clusterize your transcripts removing mutliples CDS overlap
ingenannot clusterize all_transcripts.gff transcipts.gff -f genes.gff3 --keep_atts

# rank your short-reads based on junction support and coverage
ingenannot -v 2 -p 7 isoform_ranking transcripts.gff -f fof.cfg

# add UTR with short reads in onlynew mode
ingenannot -v 2 utr_refine genes.utrs.gff3 isoforms.alternatives.gff genes.utrs.all.gff3 --onlynew --utr_mode rank