I have a file gff3.txt
with this kind of datas (billions of lines):
scaffold1000|size145372 . gene 16987 23149 . - . ID=evm.TU.scaffold1000|size145372.2;Name=EVM%20prediction%20scaffold1000|size145372.2
scaffold1000|size145372 . mRNA 16987 23149 . - . ID=evm.model.scaffold1000|size145372.2;Parent=evm.TU.scaffold1000|size145372.2;Name=EVM%20prediction%20scaffold1000|size145372.2
scaffold1000|size145372 . exon 22965 23149 . - . ID=evm.model.scaffold1000|size145372.2.exon1;Parent=evm.model.scaffold1000|size145372.2
scaffold9|size467357 . gene 373475 396789 . + . ID=evm.TU.scaffold9|size467357.56;Name=EVM%20prediction%20scaffold9|size467357.56
scaffold9|size467357 . mRNA 373475 396789 . + . ID=evm.model.scaffold9|size467357.56;Parent=evm.TU.scaffold9|size467357.56;Name=EVM%20prediction%20scaffold9|size467357.56
scaffold9|size467357 . exon 373475 373695 . + . ID=evm.model.scaffold9|size467357.56.exon1;Parent=evm.model.scaffold9|size467357.56
...
And an other file `position.txt (billions of lines):
scaffold1000|size145372.2 scaffold1000|size145372:16987-23149
scaffold9|size467357.56 scaffold10008|size45161:373475-396789
...
And I search to obtain this:
scaffold1000|size145372 . gene 16987 23149 . - . ID=evm.TU.scaffold1000|size145372:16987-23149;Name=EVM%20prediction%20scaffold1000|size145372:16987-23149
scaffold1000|size145372 . mRNA 16987 23149 . - . ID=evm.model.scaffold1000|size145372:16987-23149;Parent=evm.TU.scaffold1000|size145372:16987-23149;Name=EVM%20prediction%20scaffold1000|size145372:16987-23149
scaffold1000|size145372 . exon 22965 23149 . - . ID=evm.model.scaffold1000|size145372:16987-23149.exon1;Parent=evm.model.scaffold1000|size145372:16987-23149
scaffold9|size467357 . gene 373475 396789 . + . ID=evm.TU.scaffold10008|size45161:373475-396789;Name=EVM%20prediction%20scaffold10008|size45161:373475-396789
scaffold9|size467357 . mRNA 373475 396789 . + . ID=evm.model.scaffold10008|size45161:373475-396789;Parent=evm.TU.scaffold10008|size45161:373475-396789;Name=EVM%20prediction%20scaffold10008|size45161:373475-396789
scaffold9|size467357 . exon 373475 373695 . + . ID=evm.model.scaffold10008|size45161:373475-396789.exon1;Parent=evm.model.scaffold10008|size45161:373475-396789
...
So I would like to find in the column $9
of the gff3.txt
file the patterns that match with the column $1
in position.txt
and then change them with the pattern of the column 2 of the position.txt
file.
I tried with awk:
awk '
NR==FNR{a[$9]
next
}
($2 in a) {
print
}' gff3.txt position.txt > output.txt
But this didn't work. Maybe is due to because of the patterns in the column $9
of the gff3.txt
are included in other information?
I also try to adapt these threads with my datas but I didn't achieve it: stackoverflow1, stackoverflow2, stackoverflow3, stackExchange...
Any advice for coding this in awk
, sed
or others will be very appreciated.