0

I have a file gff3.txt with this kind of datas (billions of lines):

 scaffold1000|size145372 . gene 16987 23149 . - . ID=evm.TU.scaffold1000|size145372.2;Name=EVM%20prediction%20scaffold1000|size145372.2
 scaffold1000|size145372 . mRNA 16987 23149 . - . ID=evm.model.scaffold1000|size145372.2;Parent=evm.TU.scaffold1000|size145372.2;Name=EVM%20prediction%20scaffold1000|size145372.2
 scaffold1000|size145372 . exon 22965 23149 . - . ID=evm.model.scaffold1000|size145372.2.exon1;Parent=evm.model.scaffold1000|size145372.2
 scaffold9|size467357 . gene 373475 396789 . + . ID=evm.TU.scaffold9|size467357.56;Name=EVM%20prediction%20scaffold9|size467357.56
 scaffold9|size467357 . mRNA 373475 396789 . + . ID=evm.model.scaffold9|size467357.56;Parent=evm.TU.scaffold9|size467357.56;Name=EVM%20prediction%20scaffold9|size467357.56
 scaffold9|size467357 . exon 373475 373695 . + . ID=evm.model.scaffold9|size467357.56.exon1;Parent=evm.model.scaffold9|size467357.56
 ...

And an other file `position.txt (billions of lines):

 scaffold1000|size145372.2  scaffold1000|size145372:16987-23149
 scaffold9|size467357.56    scaffold10008|size45161:373475-396789
 ...

And I search to obtain this:

 scaffold1000|size145372 . gene 16987 23149 . - . ID=evm.TU.scaffold1000|size145372:16987-23149;Name=EVM%20prediction%20scaffold1000|size145372:16987-23149
 scaffold1000|size145372 . mRNA 16987 23149 . - . ID=evm.model.scaffold1000|size145372:16987-23149;Parent=evm.TU.scaffold1000|size145372:16987-23149;Name=EVM%20prediction%20scaffold1000|size145372:16987-23149
 scaffold1000|size145372 . exon 22965 23149 . - . ID=evm.model.scaffold1000|size145372:16987-23149.exon1;Parent=evm.model.scaffold1000|size145372:16987-23149
 scaffold9|size467357 . gene 373475 396789 . + . ID=evm.TU.scaffold10008|size45161:373475-396789;Name=EVM%20prediction%20scaffold10008|size45161:373475-396789
 scaffold9|size467357 . mRNA 373475 396789 . + . ID=evm.model.scaffold10008|size45161:373475-396789;Parent=evm.TU.scaffold10008|size45161:373475-396789;Name=EVM%20prediction%20scaffold10008|size45161:373475-396789
 scaffold9|size467357 . exon 373475 373695 . + . ID=evm.model.scaffold10008|size45161:373475-396789.exon1;Parent=evm.model.scaffold10008|size45161:373475-396789
 ...

So I would like to find in the column $9 of the gff3.txt file the patterns that match with the column $1 in position.txt and then change them with the pattern of the column 2 of the position.txt file.

I tried with awk:

 awk '
     NR==FNR{a[$9]
     next
 }
 ($2 in a) {
     print
 }' gff3.txt position.txt > output.txt

But this didn't work. Maybe is due to because of the patterns in the column $9 of the gff3.txt are included in other information?

I also try to adapt these threads with my datas but I didn't achieve it: stackoverflow1, stackoverflow2, stackoverflow3, stackExchange...

Any advice for coding this in awk, sed or others will be very appreciated.

Claudio
  • 10,614
  • 4
  • 31
  • 71
  • what's the size of `position.txt` file (number of entries) ? – RomanPerekhrest May 16 '19 at 09:32
  • @RomanPerekhrest Both files have billions of lines (I add this info with edit) – Pierre-louis Stenger May 16 '19 at 09:35
  • Don't ``awk` '` just `awk '`. Without backticks.. Can you say exactly what is the value of `$9`? From these columns from `gff3.txt`? You want to `join` files, but I can't see on which field do you join them. You want to just substitute `scaffold1000|size145372.2` for `scaffold1000|size145372:16987-23149` ? Are the files sorted on a specifing column? Can they be joined on the first column before the dot? – KamilCuk May 16 '19 at 09:36
  • @ Kamil Cuk Thanks, it was just an error when I wrote my question, I was just awk when I try it. I edit my post, thanks again. – Pierre-louis Stenger May 16 '19 at 09:39
  • Will just a simple `sed 's/scaffold9|size467357.56/scaffold10008|size45161:373475-396789/g'` do? – KamilCuk May 16 '19 at 09:42
  • @KamilCuk the both file are sorted. I have billions of scaffoldXXX|sizeXXXX.XX that I want to change into scaffoldXXX|sizeXXXX:XXX-XXX – Pierre-louis Stenger May 16 '19 at 09:44
  • Sorted using which field? Sorted alphabetically? If they are properly sorted ( I see that `join -11 -21 <( – KamilCuk May 16 '19 at 09:52

2 Answers2

0

Found this solution and I kinda like it but be sure to copy your file first as it will replace content on it and you may lose informations if that does not work as expected.

GNU sed :

sed 's, +,$ ,g' position.txt | xargs -I {} sed -i 's {} g' gff3.txt

MAC OS :

sed -E 's, +,$ ,' position.txt | xargs -I {} sed -i '' 's {} g' gff3.txt

xargs -I will execute sed for each line of position.txt on gff3.txt

{} will be replaced by the line of gff3.txt. First column is used as pattern, second as the new value.

Corentin Limier
  • 4,946
  • 1
  • 13
  • 24
  • and how this would make a proper replacements? – RomanPerekhrest May 16 '19 at 09:37
  • @RomanPerekhrest Have you tried this solution ? I got the exact expected result. – Corentin Limier May 16 '19 at 09:45
  • @RomanPerekhrest have you tried the GNU sed solution ? Looks like your sed does not like the '' after -i option – Corentin Limier May 16 '19 at 09:46
  • I think that I fixed the GNU solution. Problem is that position.txt has multiple spaces and GNU sed don't like this. Thanks for the help. – Corentin Limier May 16 '19 at 09:57
  • make a proper corrections for MacOS solution if it pretends to be an alternative – RomanPerekhrest May 16 '19 at 09:57
  • @RomanPerekhrest it works on my MacOS so I don't understand well what is the problem here. I don't know well the differences between all versions of bash/sed, if you have any suggestion please let me know. – Corentin Limier May 16 '19 at 09:59
  • @CorentinLimier Strangely it paste weird pattern like this: `scaffold1000|size145372:94701-95924987-23149`. Before it was `scaffold1000|size145372.2` in the `gff3.txt`, and in the `position.txt` near by `scaffold1000|size145372.2` there is this pattern: `scaffold1000|size145372:16987-23149`, and few line below, near by `scaffold1000|size145372.16` there is `scaffold1000|size145372:94701-95924`. So it concatenate the XXXX:XXXX pattern for all scaffold which have the same first number after the dot like scaffoldXX|sizeXX.1, scaffoldXX|sizeXX.12, scaffoldXX|sizeXX.13... (I use the MacOs code) – Pierre-louis Stenger May 16 '19 at 11:41
  • Could you try this ? `sed -E 's, +,$ ,' position.txt | xargs -I {} sed -i '' 's {} g' gff3.txt` – Corentin Limier May 16 '19 at 12:37
0

I came up with this:

sed "$(<position.txt sed 's/\./\\./g' | xargs -n2 printf "s@%s@%s@g\n")" gff3.txt

First I take position.txt substitute every . with \., so that it's escaped for sed. Then from each line I generate sed substitution command s/<first column>/<second column>/g using xargs and simple printf. The output is fed to sed as the script and sed takes gff3.txt and runs the transformations on it. If there are no "strange" inputs (embedded spaces, newlines, all strings are unique, etc.), I think this may handle.

Test script:

#!/bin/bash

cat <<EOF >gff3.txt 
scaffold1000|size145372 . gene 16987 23149 . - . ID=evm.TU.scaffold1000|size145372.2;Name=EVM%20prediction%20scaffold1000|size145372.2
scaffold1000|size145372 . mRNA 16987 23149 . - . ID=evm.model.scaffold1000|size145372.2;Parent=evm.TU.scaffold1000|size145372.2;Name=EVM%20prediction%20scaffold1000|size145372.2
scaffold1000|size145372 . exon 22965 23149 . - . ID=evm.model.scaffold1000|size145372.2.exon1;Parent=evm.model.scaffold1000|size145372.2
scaffold9|size467357 . gene 373475 396789 . + . ID=evm.TU.scaffold9|size467357.56;Name=EVM%20prediction%20scaffold9|size467357.56
scaffold9|size467357 . mRNA 373475 396789 . + . ID=evm.model.scaffold9|size467357.56;Parent=evm.TU.scaffold9|size467357.56;Name=EVM%20prediction%20scaffold9|size467357.56
scaffold9|size467357 . exon 373475 373695 . + . ID=evm.model.scaffold9|size467357.56.exon1;Parent=evm.model.scaffold9|size467357.56
EOF

cat <<EOF >position.txt
scaffold1000|size145372.2  scaffold1000|size145372:16987-23149
scaffold9|size467357.56    scaffold10008|size45161:373475-396789
EOF

cat <<EOF >exp.txt
scaffold1000|size145372 . gene 16987 23149 . - . ID=evm.TU.scaffold1000|size145372:16987-23149;Name=EVM%20prediction%20scaffold1000|size145372:16987-23149
scaffold1000|size145372 . mRNA 16987 23149 . - . ID=evm.model.scaffold1000|size145372:16987-23149;Parent=evm.TU.scaffold1000|size145372:16987-23149;Name=EVM%20prediction%20scaffold1000|size145372:16987-23149
scaffold1000|size145372 . exon 22965 23149 . - . ID=evm.model.scaffold1000|size145372:16987-23149.exon1;Parent=evm.model.scaffold1000|size145372:16987-23149
scaffold9|size467357 . gene 373475 396789 . + . ID=evm.TU.scaffold10008|size45161:373475-396789;Name=EVM%20prediction%20scaffold10008|size45161:373475-396789
scaffold9|size467357 . mRNA 373475 396789 . + . ID=evm.model.scaffold10008|size45161:373475-396789;Parent=evm.TU.scaffold10008|size45161:373475-396789;Name=EVM%20prediction%20scaffold10008|size45161:373475-396789
scaffold9|size467357 . exon 373475 373695 . + . ID=evm.model.scaffold10008|size45161:373475-396789.exon1;Parent=evm.model.scaffold10008|size45161:373475-396789
EOF

sed "$(<position.txt sed 's/\./\\./g' | xargs -n2 printf "s@%s@%s@g\n")" gff3.txt > output.txt

diff exp.txt output.txt

diff prints nothing, so It works for specified example input and expected output.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111