1

I have two files. One contains a list of items, e.g.,

Allie
Bob
John
Laurie

Another file (file2) contains a different list of items in a different order, but some items might overlap with the items in file 1, e.g,

Laurie 45 56 6 75
Moxipen 10 45 56 56
Allie 45 56 67 23

I want to intersect these two files and extract only those lines from file 2 whose first field matches an item in field 1.

i.e., my output should be

Allie 45 56 67 23
Laurie 45 56 6 75

(preferably in this order, but it's OK if not)

grep -f file1 file2 doesn't do what I want.

I also need something efficient because the second file is HUGE.

I also tried this:

awk -F, 'FNR==NR {a[$1]=$0; next}; $1 in a {print a[$1]}' file2 file1
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Bondrak
  • 1,370
  • 2
  • 9
  • 15
  • Possible duplicate of [Fastest way to find lines of a text file from another larger text file in Bash](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash) – Sundeep Dec 08 '17 at 15:50
  • Thanks for the pointer. Not exactly. – Bondrak Dec 08 '17 at 15:54
  • Do you need to preserve sort order? The `join` command can do this job, if you're willing to sort on the field you're joining on (which can be done inline w/ process substitutions with `sort`). – Charles Duffy Dec 08 '17 at 16:00
  • (and if your files are pre-sorted on the key field, `join` will be extremely efficient -- single-pass, no seeking, minimal memory requirements). – Charles Duffy Dec 08 '17 at 16:03
  • 1
    the `awk` code you tried would give the order you want, just remove `-F,` – Sundeep Dec 08 '17 at 16:06
  • 1
    ...that said, the awk code requires that file1 fit in memory. If our files are huge, that may not be ideal. (GNU `sort` can sort files larger than memory -- it sorts to temporary files and merges them together in that case -- and `join` doesn't store more than a few lines in memory at the same time in any event). – Charles Duffy Dec 08 '17 at 16:07

3 Answers3

1

If order doesn't matter then

awk 'FNR==NR{ arr[$1]; next }$1 in arr' file1 file2

Explanation

  • FNR==NR{ arr[$1]; next } Here we read first file (file1), arr is array, whose index key being first field $1.
  • $1 in arr we read second file ( file2), if array arr which was created while reading first file, has index key which is second file's first column ($1 in arr gives true, if index key exists), then print current record/row/line from file2

Test Results:

akshay@db-3325:/tmp$ cat file1
Allie
Bob
John
Laurie

akshay@db-3325:/tmp$ cat file2
Laurie 45 56 6 75
Moxipen 10 45 56 56
Allie 45 56 67 23

akshay@db-3325:/tmp$ awk 'FNR==NR{ arr[$1]; next }$1 in arr' file1 file2
Laurie 45 56 6 75
Allie 45 56 67 23
Akshay Hegde
  • 16,536
  • 2
  • 22
  • 36
1

No need for complex joins, it is a filtering function

$ grep -wFf file1 file2

Laurie 45 56 6 75
Allie 45 56 67 23

has the benefit or keeping the order in file2 as well. -w option is for full word matches to eliminate sub-string matches to create false positives. Of course if your sample input is not representative and your data may contain key like entries in other fields this will not work without qualifying beginning of line.

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • I tried this one as well. It returns contents of file 2... but it does work on the the toy data I provided. I guess the issue is the encoding (utf-8). I'll try with export LC_ALL=C and report. – Bondrak Dec 08 '17 at 17:33
  • It didn't help. – Bondrak Dec 08 '17 at 18:08
  • It means that the sample files are not representative of your actual files. Perhaps you have different file formats (dos2unix?). Manually inspect with `cat -A file2`. You can also try `grep -wF somekey file2` – karakfa Dec 08 '17 at 18:38
0

This is the job that join is built for.

Providing a reproducer testable via copy-and-paste with shell functions (which you could replace with your actual input files):

cat_file1() {
  printf '%s\n' Allie Bob John Laurie
}

cat_file2() {
  printf '%s\n' 'Laurie 45 56 6 75' \
                'Moxipen 10 45 56 56' \
                'Allie 45 56 67 23'
}

join <(cat_file1 | sort) <(cat_file2 | sort)

...properly emits:

Allie 45 56 67 23
Laurie 45 56 6 75

Of course, don't cat file1 | sort -- run sort <file1 to provide a real handle for better efficiency, or (better!) store your inputs in sorted form in the first place.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Your example ran fine. When I input the actual files, it returns 0 lines. Trying to figure out why. – Bondrak Dec 08 '17 at 16:24
  • Hmm. It won't return anything if the inputs aren't sorted on the key fields by the time they reach `join` -- that's the most obvious thing, but the process substitutions going through `sort` should resolve it. Let me know if you figure out how to reproduce the issue, or can provide more details I could use to try to puzzle out what's going on. – Charles Duffy Dec 08 '17 at 16:26
  • not sure. Both are sorted. Maybe char encoding. I'm dealing with utf-8. – Bondrak Dec 08 '17 at 17:13
  • If you `export LC_ALL=C` before running of the pertinent commands, that's one fewer variable in play (will ensure that `sort` is doing, and `join` expecting, a simple ASCII-value sort). Can you generate a smaller version of your real input files that reproduces the issue, and then try to work from there to get to a reproducer that can be shared/posted? I'd be happy to check out a [gist](https://gist.github.com/) with inputs that reproduce the issue and debug from there. – Charles Duffy Dec 08 '17 at 17:18