1

I have 2 different data frame, one is of 5.5 MB and the other is 25 GB. I want to check if these two data frame have the same value in 2 different columns for each row.

For e.g.

x 0 0 a
x 1 2  b
y 1 2 c
z 3 4  d

and

x 0 0 w
x 1 2  m
y 5 6 p
z 8 9  q

I want to check if the 2° and 3° column are equal for each row, if yes I return the 4° columns for the both data frame.Then I should have:

a w
b m
c m

the 2 data frame are sorted respect the 2° and 3° column value. I try in R but the 2° file (25 GB) is too big. How can I obtain this new file in a "faster" (even some hours) way ???

Will
  • 1,619
  • 5
  • 23

2 Answers2

2

With GNU awk for arrays of arrays:

$ cat tst.awk
NR==FNR { a[$2,$3][$4]; next }
($2,$3) in a {
    for (val in a[$2,$3]) {
        print val, $4
    }
}

$ awk -f tst.awk small_file large_file
a w
b m
c m

and with any awk (a bit less efficiently):

$ cat tst.awk
NR==FNR { a[$2,$3] = a[$2,$3] FS $4; next }
($2,$3) in a {
    split(a[$2,$3],vals)
    for (i in vals) {
        print vals[i], $4
    }
}

$ awk -f tst.awk small_file large_file
a w
b m
c m

The above when reading small_file (NR==FNR is only true for the first file read - look up those variables in the awk man page or google) creates an associative array a[] that maps an index created from the concatenation of the 2nd+3rd fields to the list of value of the 4th field for those 2nd/3rd field combinations. Then when reading large_file it looks up that array for the current 2nd/3rd field combination and loops through all of the values stored for that combination in the previous phase printing that value (the $4 from small_file) plus the current $4.

You said your small file is 5.5 MB and the large file is 25 GB. Since 1 MB is about 1,047,600 characters (see https://www.computerhope.com/issues/chspace.htm) and each of your lines is about 8 characters long that means your small file is about 130 thousand lines long and your large one about 134 million lines long so I expect on an average powered computer the above should take no more than a minute or 2 to run, it certainly won't take anything like an hour!

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    Thanks for your ansewer. Can you please me explain your code (e.g. the 1° one) line per line ? I'm not very practical with awk. Thanks in advance – Will Feb 23 '20 at 12:13
  • 1
    You're welcome. I added an explanation. You can always add print statements to see what the script is doing too. I'd be curious to know how long it did take to run on your real data if you'd care to share. – Ed Morton Feb 23 '20 at 14:44
  • 1
    Thanks, it takes about 15 minutes – Will Feb 23 '20 at 15:32
  • If I want to check if is equal also the 1° column I should add: NR==FNR { a[$1,$2,$3][$4]; next } ($1,$2,$3) in a { for (val in a[$1,$2,$3]) { print val, $4 } } it is correct ? – Will Feb 23 '20 at 16:23
0

An alternative to the solution of Ed Morton, but with an identical idea:

$ cat tst.awk
NR==FNR { a[$2,$3] = a[$2,$3] $4 ORS; next }
($2,$3) in a {
    s=a[$2,$3]; gsub(ORS,OFS $4 ORS,s) 
    printf "%s",s;
}

$ awk -f tst.awk small_file large_file
a w
b m
c m
kvantour
  • 25,269
  • 4
  • 47
  • 72