Add UTRs to gene models
Here we propose a protocole to add UTRs to your gene models based on long-read and RNA-Seq data. We will prepare the data and use both of datatype to add UTRs. We suppose that the long-read data provide a better/ more reliable definition of UTRs part, compare to transcript assemblies obtained from RNA-Seq data. We will firstly use long-read data ranked by most supported isoforms with RNA-Seq data. Then add potential new UTRs with RNA-Seq transcripts if not still annotated. In case you do not have one type of data, you can limit the protocole to the available data.
Workflow:
1) Add UTRs from long-read data if available
# rank your long-reads based on junction support and coverage
ingenannot -v 2 isoform_ranking longreads.gff -f file.fof --alt_threshold 0.1 --rescue
# add UTR with long reads using rank as preferred isoform
ingenannot.py -v 2 utr_refine genes.gff3 isoforms.alternatives.gff genes.utrs.gff3 --erase --utr_mode rank
2) Add UTRs from short-read assemblies
# add UTR with short reads in onlynew mode
# if you want to combine several transcript assemblies from several runs,
# you have to merge the transcripts comming from the same gene
# To merge your transcript file, you can use Stringtie in merge mode.
# If you want to be sure to avoid trancript with multiple CDS
# overlaps, perform as described below:
cat assembly.1.gff assembly.2.gff assembly.3.gff assembly.4.gff ... > all_transcripts.gff
# clusterize your transcripts removing mutliples CDS overlap
ingenannot clusterize all_transcripts.gff transcipts.gff -f genes.gff3 --keep_atts
# rank your short-reads based on junction support and coverage
ingenannot -v 2 -p 7 isoform_ranking transcripts.gff -f fof.cfg
# add UTR with short reads in onlynew mode
ingenannot -v 2 utr_refine genes.utrs.gff3 isoforms.alternatives.gff genes.utrs.all.gff3 --onlynew --utr_mode rank