11

Looking to perform an inner join on two different text files. Basically I'm looking for the inner join equivalent of the GNU join program. Does such a thing exist? If not, an awk or sed solution would be most helpful, but my first choice would be a Linux command.

Here's an example of what I'm looking to do

file 1:

0|Alien Registration Card LUA|Checklist Update
1|Alien Registration Card LUA|Document App Plan
2|Alien Registration Card LUA|SA Application Nbr
3|Alien Registration Card LUA|tmp_preapp-DOB
0|App - CSCE Certificate LUA|Admit Type
1|App - CSCE Certificate LUA|Alias 1
2|App - CSCE Certificate LUA|Alias 2
3|App - CSCE Certificate LUA|Alias 3
4|App - CSCE Certificate LUA|Alias 4

file 2:

Alien Registration Card LUA

Results:

0|Alien Registration Card LUA|Checklist Update
1|Alien Registration Card LUA|Document App Plan
2|Alien Registration Card LUA|SA Application Nbr
3|Alien Registration Card LUA|tmp_preapp-DOB
codeforester
  • 39,467
  • 16
  • 112
  • 140
Dave Snigier
  • 2,574
  • 3
  • 21
  • 29

5 Answers5

10

Here's an awk option, so you can avoid the bash dependency (for portability):

$ awk -F'|' 'NR==FNR{check[$0];next} $2 in check' file2 file1

How does this work?

  • -F'|' -- sets the field separator
  • 'NR==FNR{check[$0];next} -- if the total record number matches the file record number (i.e. we're reading the first file provided), then we populate an array and continue.
  • $2 in check -- If the second field was mentioned in the array we created, print the line (which is the default action if no actions are provided).
  • file2 file1 -- the files. Order is important due to the NR==FNR construct.
ghoti
  • 45,319
  • 8
  • 65
  • 104
  • 1
    I had a scenario where i had a list of things, one where repeated rows others were not, so in the end, I had two files: one with the repeated rows only and other with the uniques. So when I ran this awk it merged them both like a charm. Thanks @ghoti it worked! – Ricardo Rivera Nieves Aug 19 '20 at 12:41
8

Should not the file2 contain LUA at the end?

If yes, you can still use join:

join -t'|' -12 <(sort -t'|' -k2 file1) file2
choroba
  • 231,213
  • 25
  • 204
  • 289
  • Thank you! Yes it should (I've updated the question). I did not know that one can shell out and feed that result back in through stdin. Quite useful! – Dave Snigier Nov 07 '12 at 16:00
  • @DaveSnigier: You can always create a temp file, but Process Substitution is much shorter to type. – choroba Nov 07 '12 at 16:08
  • *Both* input files need to be sorted for `join` to work correctly. This is a major drawback especially if your files are big. The Awk solution will probably be more useful for many visitors. – tripleee May 05 '21 at 17:51
8

Looks like you just need

grep -F -f file2 file1
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • 1
    The caveat is that this will find *substring* matches; i.e. if the file with the keys contains `dog` it will find lines in the other file which contain `endogenous` or `doggone`. You can fix this by switching from `grep -F` - which searches for static strings - to plain `grep` (or `grep -E` for a more versatile regex dialect) which searches by regular expressions, but then you have to modify the pattern file etc, at which point other answers on this page will probably be less complex and easier to understand. But if your keys are long or you know there are no accidental matches, this works. – tripleee May 05 '21 at 17:41
  • That's a good rescue in some situations, but not in others, it will prevent a word from matching part of another word, but will not be sufficient if your fields could sometimes contain multiple words (i.e. "full" will match "not full", even with `-w`). – tripleee May 05 '21 at 17:58
4

You may modify this script:

cat file2 | while read line; do
    grep $line file1 # or whatever you want to do with the $line variable
done

while loop reads file2 line by line and gives that line to the grep command that greps that line in file1. There're some extra output that maybe removed with grep options.

hcg
  • 642
  • 5
  • 8
  • Very clever! I will be able to find many uses for this pattern beyond this immediate problem – Dave Snigier Nov 07 '12 at 15:58
  • 3
    Don't forget to quote `$line`. If it contains spaces, it could be expanded badly. Also, if you're looking for fixed string matching rather than regular expressions (remember, it's gREp), then use the `-F` option. – ghoti Nov 07 '12 at 16:09
  • 3
    This is just an inefficient paraphrase of [this anwer](http://stackoverflow.com/a/13278763/874188) by @glennjackman. (Yes, his was posted later.) A single `grep` which reads in the patterns and then checks the target file just once is vastly more efficient than running one `grep` for each pattern, especially of course for large inputs. – tripleee Dec 16 '15 at 10:21
  • 1
    @ghoti's comment could be more strongly worded. Even if your input contains no spaces, things could go horribly wrong if the input ever contains any shell metacharacter. TL;DR It' *wrong* to omit the quotes here. – tripleee Mar 30 '17 at 13:53
  • You may want to make that something like `grep ^$line\$ file1` to avoid substring matches. – ketil Apr 28 '17 at 10:43
  • Congratulations! Your code snippet was found onto a device manufactured by Qisda/BenQ. ES600/Avant3/OYO **;-)** – 0x2b3bfa0 Aug 07 '17 at 12:23
  • 2
    @ketil: No need; just use `grep -Fx` (`-F` = fixed-string (not regexp), `-x` = line-match). – wchargin Jul 16 '18 at 05:10
  • @ketil That's not a good regex for a *join*. To `grep` for the key with the separator on both sides, try `grep "|$line|"`; of course, change the first delimiter to `^` if you want to join on the first field on each line, or the last one to `$` for the final field, obviously. – tripleee May 05 '21 at 17:48
1

You can use paste command to combine file :

paste [option] source files [>destination file]

for your example it would be

paste file1.txt file2.txt >result.txt
  • Close, but paste will only join on the line numbers of two files. I'm actually looking to join on a field in the file instead. – Dave Snigier Nov 07 '12 at 16:03