I have gff
file, the contents are like the following (tab separated):
# start gene 1Chr.g1
1Chr AUGUSTUS gene 3636 5916 0.1 + . ID=1Chr.g1
1Chr AUGUSTUS transcript 3636 5916 0.1 + . ID=1Chr.g1.t1;Parent=1Chr.g1
1Chr AUGUSTUS transcription_start_site 3636 3636 . + . Parent=1Chr.g1.t1
1Chr AUGUSTUS exon 3636 3913 . + . Parent=1Chr.g1.t1
1Chr AUGUSTUS start_codon 3760 3762 . + 0 Parent=1Chr.g1.t1
1Chr AUGUSTUS intron 3914 3995 1 + .
1Chr AUGUSTUS CDS 3760 3913 1 + 0 ID=1Chr.g1.t1.cds;Parent=1Chr.g1.t1
1Chr AUGUSTUS stop_codon 5628 5630 . + 0 Parent=1Chr.g1.t1
1Chr AUGUSTUS transcription_end_site 5916 5916 . + . Parent=1Chr.g1.t1
# start gene 1Chr.g2
1Chr AUGUSTUS gene 5938 8761 0.17 - . ID=1Chr.g2
1Chr AUGUSTUS transcript 5938 8761 0.17 - . ID=1Chr.g2.t1;Parent=1Chr.g2
1Chr AUGUSTUS transcription_end_site 5938 5938 . - . Parent=1Chr.g2.t1
1Chr AUGUSTUS exon 5938 6594 . - . Parent=1Chr.g2.t1
1Chr AUGUSTUS stop_codon 6428 6430 . - 0 Parent=1Chr.g2.t1
1Chr AUGUSTUS intron 6595 7156 0.8 - . Parent=1Chr.g2.t1
1Chr AUGUSTUS CDS 6428 6594 0.89 - 2 ID=1Chr.g2.t1.cds;Parent=1Chr.g2.t1
# start gene 2Chr.g1
2Chr AUGUSTUS gene 11612 13481 0.09 - . ID=2Chr.g1
2Chr AUGUSTUS transcript 11612 13481 0.09 - . ID=2Chr.g1.t1;Parent=2Chr.g1
2Chr AUGUSTUS transcription_end_site 11612 11612 . - . Parent=2Chr.g1.t1
2Chr AUGUSTUS exon 11612 13481 . - . Parent=2Chr.g1.t1
2Chr AUGUSTUS stop_codon 11864 11866 . - 0 Parent=2Chr.g1.t1
2Chr AUGUSTUS CDS 11864 12940 1 - 0 ID=2Chr.g1.t1.cds;Parent=2Chr.g1.t1
2Chr AUGUSTUS start_codon 12938 12940 . - 0 Parent=2Chr.g1.t1
2Chr AUGUSTUS transcription_start_site 13481 13481 . - . Parent=2Chr.g1.t1
# start gene 2Chr.g2
2Chr AUGUSTUS gene 22876 31223 0.04 + . ID=2Chr.g2
2Chr AUGUSTUS transcript 22876 31223 0.04 + . ID=2Chr.g2.t1;Parent=2Chr.g2
2Chr AUGUSTUS transcription_start_site 22876 22876 . + . Parent=2Chr.g2.t1
2Chr AUGUSTUS exon 22876 23456 . + . Parent=2Chr.g2.t1
2Chr AUGUSTUS exon 23515 24451 . + . Parent=2Chr.g2.t1
2Chr AUGUSTUS start_codon 23519 23521 . + 0 Parent=2Chr.g2.t1
I want to replace the IDs of the genes which are 1Chr.g1
, 1Chr.g2
, 2Chr.g1
, and 2Chr.g2
to just in sequence like start from g1
to end of the IDs like in this case g4
.
Expected Output
# start gene g1
1Chr AUGUSTUS gene 3636 5916 0.1 + . ID=g1
1Chr AUGUSTUS transcript 3636 5916 0.1 + . ID=g1.t1;Parent=g1
1Chr AUGUSTUS transcription_start_site 3636 3636 . + . Parent=g1.t1
1Chr AUGUSTUS exon 3636 3913 . + . Parent=g1.t1
1Chr AUGUSTUS start_codon 3760 3762 . + 0 Parent=g1.t1
1Chr AUGUSTUS intron 3914 3995 1 + .
1Chr AUGUSTUS CDS 3760 3913 1 + 0 ID=g1.t1.cds;Parent=g1.t1
1Chr AUGUSTUS stop_codon 5628 5630 . + 0 Parent=g1.t1
1Chr AUGUSTUS transcription_end_site 5916 5916 . + . Parent=g1.t1
# start gene g2
1Chr AUGUSTUS gene 5938 8761 0.17 - . ID=g2
1Chr AUGUSTUS transcript 5938 8761 0.17 - . ID=g2.t1;Parent=g2
1Chr AUGUSTUS transcription_end_site 5938 5938 . - . Parent=g2.t1
1Chr AUGUSTUS exon 5938 6594 . - . Parent=g2.t1
1Chr AUGUSTUS stop_codon 6428 6430 . - 0 Parent=g2.t1
1Chr AUGUSTUS intron 6595 7156 0.8 - . Parent=g2.t1
1Chr AUGUSTUS CDS 6428 6594 0.89 - 2 ID=g2.t1.cds;Parent=g2.t1
# start gene g3
2Chr AUGUSTUS gene 11612 13481 0.09 - . ID=g3
2Chr AUGUSTUS transcript 11612 13481 0.09 - . ID=g3.t1;Parent=g3
2Chr AUGUSTUS transcription_end_site 11612 11612 . - . Parent=g3.t1
2Chr AUGUSTUS exon 11612 13481 . - . Parent=g3.t1
2Chr AUGUSTUS stop_codon 11864 11866 . - 0 Parent=g3.t1
2Chr AUGUSTUS CDS 11864 12940 1 - 0 ID=g3.t1.cds;Parent=g3.t1
2Chr AUGUSTUS start_codon 12938 12940 . - 0 Parent=g3.t1
2Chr AUGUSTUS transcription_start_site 13481 13481 . - . Parent=g3.t1
# start gene g4
2Chr AUGUSTUS gene 22876 31223 0.04 + . ID=g4
2Chr AUGUSTUS transcript 22876 31223 0.04 + . ID=g4.t1;Parent=g4
2Chr AUGUSTUS transcription_start_site 22876 22876 . + . Parent=g4.t1
2Chr AUGUSTUS exon 22876 23456 . + . Parent=g4.t1
2Chr AUGUSTUS exon 23515 24451 . + . Parent=g4.t1
2Chr AUGUSTUS start_codon 23519 23521 . + 0 Parent=g4.t1
I wrote the following bash script, but it took too long, as I tried to count its time, so for one sed
it took 1 second, and if there are 28000
iterations it will take about 8 hours, which is too much time.
Is there any efficient way to do this?
awk '$3 == "gene"' $1 |cut -f9 |grep -o "=.*" |sed -e 's/=//g' >LIST.txt
COUNTER=0
cat LIST.txt | while read line; do
COUNTER=$(expr $COUNTER + 1)
echo "sed -i 's/$line/g$COUNTER/g' $1" |bash
done
rm LIST.txt
Another thing, generate a file sedTG45
which is very annoying.