How to get the pivot lines from two tab-separated files?

Question

Given two files file1.txt

abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432

and file2.txt

foo bar \t hello world
abc def \t good morning
xyz \t 456

The task is to extract the lines where the first column matches and achieve:

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

I can do it in Python as such:

from io import StringIO

file1 = """abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432"""


file2 = """foo bar \t hello world
abc def \t good morning
xyz \t 456"""

map1, map2 = {}, {}

with StringIO(file1) as fin1:
    for line in file1.split('\n'):
        one, two = line.strip().split('\t')
        map1[one] = two
    
    
with StringIO(file2) as fin2:
    for line in file2.split('\n'):
        one, two = line.strip().split('\t')
        map2[one] = two
        
        
for k in set(map1).intersection(set(map2)):
    print('\t'.join([k, map1[k], map2[k]]))

The actual task files have billions of lines, are there faster solution without loading everything and keeping the hashmaps/dictionaries?

Maybe using unix/bash commands? Would pre-sorting the files help?

score 5 · Accepted Answer · answered Feb 09 '21 at 15:18

The join command is sometimes hard to work with, but here it's straightforward:

join -t $'\t' <(sort file1.txt) <(sort file2.txt)

That uses bash's ANSI-C quoting to specify the tab separator, and process substitutions to treat program output as a file.

To see the output, pipe the above into cat -A to see the tabs represented as ^I:

abc def^I123 456^Igood morning$
foo bar^I789 123^Ihello world$

score 2 · Answer 2 · answered Feb 09 '21 at 15:09

You may try this awk:

awk '{key = $1 FS $2} FNR==NR {sub(/^([^[:blank:]]+[[:blank:]]+){2}/, ""); map[key] = $0; next} key in map {print $0, map[key]}' file2.txt file1.txt

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

A more readable version:

awk '{
   key = $1 FS $2
}
FNR == NR {
   sub(/^([^[:blank:]]+[[:blank:]]+){2}/, "")
   map[key] = $0
   next
}
key in map {
   print $0, map[key]
}' file2.txt file1.txt

It only loads data from file2 into memory and processes records of file1 line by line.

aborruso · Answer 3 · 2021-02-12T11:17:49.870

1

Using Miller (https://github.com/johnkerl/miller) and its join verb

mlr --tsv --implicit-csv-header --headerless-csv-output join -j 1 --rp 2 -f file1.txt file2.txt >output.tsv

The output will be (it's only a preview, you will have tab separator):

| foo bar | 789 123 | hello world  |
| abc def | 123 456 | good morning |

edited Feb 12 '21 at 11:17

answered Feb 12 '21 at 06:58

aborruso

4,938
3
23
40

How to get the pivot lines from two tab-separated files?

3 Answers3

Linked