5

Given two files file1.txt

abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432 

and file2.txt

foo bar \t hello world
abc def \t good morning
xyz \t 456

The task is to extract the lines where the first column matches and achieve:

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

I can do it in Python as such:

from io import StringIO

file1 = """abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432"""


file2 = """foo bar \t hello world
abc def \t good morning
xyz \t 456"""

map1, map2 = {}, {}

with StringIO(file1) as fin1:
    for line in file1.split('\n'):
        one, two = line.strip().split('\t')
        map1[one] = two
    
    
with StringIO(file2) as fin2:
    for line in file2.split('\n'):
        one, two = line.strip().split('\t')
        map2[one] = two
        
        
for k in set(map1).intersection(set(map2)):
    print('\t'.join([k, map1[k], map2[k]]))

The actual task files have billions of lines, are there faster solution without loading everything and keeping the hashmaps/dictionaries?

Maybe using unix/bash commands? Would pre-sorting the files help?

oguz ismail
  • 1
  • 16
  • 47
  • 69
alvas
  • 115,346
  • 109
  • 446
  • 738

3 Answers3

5

The join command is sometimes hard to work with, but here it's straightforward:

join -t $'\t' <(sort file1.txt) <(sort file2.txt)

That uses bash's ANSI-C quoting to specify the tab separator, and process substitutions to treat program output as a file.

To see the output, pipe the above into cat -A to see the tabs represented as ^I:

abc def^I123 456^Igood morning$
foo bar^I789 123^Ihello world$
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
2

You may try this awk:

awk '{key = $1 FS $2} FNR==NR {sub(/^([^[:blank:]]+[[:blank:]]+){2}/, ""); map[key] = $0; next} key in map {print $0, map[key]}' file2.txt file1.txt

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

A more readable version:

awk '{
   key = $1 FS $2
}
FNR == NR {
   sub(/^([^[:blank:]]+[[:blank:]]+){2}/, "")
   map[key] = $0
   next
}
key in map {
   print $0, map[key]
}' file2.txt file1.txt

It only loads data from file2 into memory and processes records of file1 line by line.

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

Using Miller (https://github.com/johnkerl/miller) and its join verb

mlr --tsv --implicit-csv-header --headerless-csv-output join -j 1 --rp 2 -f file1.txt file2.txt >output.tsv

The output will be (it's only a preview, you will have tab separator):

| foo bar | 789 123 | hello world  |
| abc def | 123 456 | good morning |
aborruso
  • 4,938
  • 3
  • 23
  • 40