Fastest way to find lines in file1 which contains any keywords from file2?

Question

I have two files. The first file has three hundred thousand records shown example (file1) and second file has one hundred thousand records, shown example (file2). I am basically doing a grep of each entry on file2 with file1 and retrieve all that is in file1. I am doing this with a normal for loop:

for i in `cat file2.txt`; do cat file1 | grep -i -w $i; done > /var/tmp/file3.txt

As the data is too huge it takes me 8+ hours to complete this operation.. I need your expertise in giving me an option on how can I do this in an efficient way to deliver this in less than 2-3 hours.

Example entries

File1

server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |  
server1:user2:x:14537:100:User two:/users/user2:/bin/bash |  
server1:user3:x:14598:24:User three:/users/user3:/bin/bash |  
server1:user4:x:14598:24:User Four:/users/user4:/bin/bash |  
server1:user5:x:14598:24:User Five:/users/user5:/bin/bash |

File2

user1  
user2  
user3

when I had a similar problem, I had to recompile grep so the buffer used by the `-Ff` options would take the whole search-target file in. It's possible that the `gnu` -`F` option is autosizing it's memory consumption. But use `man grep` and read about `-F` . Also see if there is a LIMITATIONS section. Else you can build a similar tool with `awk` (assuming enough free memory to hold all of `file2`. Search here for similar Qs that have been posted. Good luck. — shellter, Feb 19 '17 at 21:52
Hi Shelter, thanks for the reply, this is whats in man page, what you are suggesting me to is do an grep -fF and see if that helps? -F Matches using fixed strings. Treats each pattern specified as a string instead of a regular expression. If an input line contains any of the patterns as a con- tiguous sequence of bytes, the line is matched. A null string matches every line. See fgrep(1) for more information. — BBJinu, Feb 19 '17 at 21:57
This can be considered a duplicate of so many similar posts in here. For example see this post: http://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash/42239352#42239352 — George Vasiliou, Feb 19 '17 at 22:09
thanks George, you are right I can take few things from the post you shared, which will help me to get what I am looking for. thank you so much, i will mark this question as complete. — BBJinu, Feb 19 '17 at 22:19
In the mean time , cat +grep is a terrible choice. You can grep a file firectly, you don't need a cat first. You can also automatically feed the grep with patterns and you don't even need a loop for file2. Thus said, you can substitute your whole code with a simple command : `grep -f file2 file1` or even better `grep -F -f file2 file1`. Don't be afraid to use the -F. — George Vasiliou, Feb 19 '17 at 22:21
For a quicker solution, maybe less than couple of minutes, see my perl solution for this purpose: http://stackoverflow.com/a/42302653/6920976 , if you choose to use that script, just run ./comp.pl "file2.txt" "file1.txt" (for your case as there it must be file1 and then file2). Thanks. — User9102d82, Feb 19 '17 at 23:01
Looks like -f and -F not working for me root@server1 # grep -i -F -f file2 file1 > file3 grep: illegal option -- F grep: illegal option -- f Usage: grep -hblcnsviw pattern file . . . root@server1 # grep -i -fF file2 file1 > file3 grep: illegal option -- f grep: illegal option -- F Usage: grep -hblcnsviw pattern file . . . — BBJinu, Feb 19 '17 at 23:02
Try `grep -Ff file2 file1` . Leave -i out of the game for testing. — George Vasiliou, Feb 19 '17 at 23:12
That too didnt work, I am using Solaris 5.10, where its not working but I used a Redhat server and looks like the command(grep -i -F -f file2 file1 > file3) is trying to get executed there. waiting for it to get completed. I may may need to get this worked in solaris server because Redhat server is just one of the test server. In mean time i will go through the perl script posted by "Useless Person" — BBJinu, Feb 19 '17 at 23:21
OK. For future comments, start your comment with @user in order users to be notified. — George Vasiliou, Feb 19 '17 at 23:28
as your man page told you, "see `fgrep(1)` So `fgrep -f file2 file1`. But given that you have a system that has `fgrep`, it probably won't be able to read all of `file2` into memory. If it doesn't work then use one of the other solutions above. Also, given your (seemingly non-linux OS), you'll do well to indicate output of `uname -srv` in all of your questions going forward. Good luck. — shellter, Feb 19 '17 at 23:46
Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/136120/discussion-on-question-by-binish-fastest-way-to-find-lines-in-file1-which-contai). — Bhargav Rao, Feb 20 '17 at 06:20
Possible duplicate of [grep a large list against a large file](http://stackoverflow.com/questions/19380925/grep-a-large-list-against-a-large-file) — codeforester, Mar 10 '17 at 07:13

User9102d82 · Accepted Answer · 2017-02-20T05:46:55.343

Give this a shot.

Test Data:

%_Host@User> head file1.txt file2.txt
==> file1.txt <==
server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |
server1:user2:x:14537:100:User two:/users/user2:/bin/bash |
server1:user3:x:14598:24:User three:/users/user3:/bin/bash |
server1:user4:x:14598:24:User Four:/users/user4:/bin/bash |
server1:user5:x:14598:24:User Five:/users/user5:/bin/bash |

==> file2.txt <==
user1
user2
user3
#user4
%_Host@User>

Output:

    %_Host@User> ./2comp.pl file1.txt file2.txt   ; cat output_comp
    server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |
    server1:user3:x:14598:24:User three:/users/user3:/bin/bash |
    server1:user2:x:14537:100:User two:/users/user2:/bin/bash |
    %_Host@User>
    %_Host@User>

Script: Please give this one more try. Re-check the file order. File1 first and then file second: ./2comp.pl file1.txt file2.txt.

%_Host@User> cat 2comp.pl
#!/usr/bin/perl

use strict ;
use warnings ;
use Data::Dumper ;

my ($file2,$file1,$output) = (@ARGV,"output_comp") ;
my (%hash,%tmp) ;

(scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ;

for (@ARGV) {
  open FH, "<$_" || die "Cannot open $_\n" ;
  while (my $line = <FH>){$line =~ s/^.+[()].+$| +?$//g ; chomp $line ; $hash{$_}{$line} = "$line"}
  close FH ;}

open FH, ">>$output" || die "Cannot open outfile!\n" ;
foreach my $k1 (keys %{$hash{$file1}}){
  foreach my $k2 (keys %{$hash{$file2}}){
    if ($k2 =~ m/^.+?$k1.+?$/i){    # Case Insensitive matching.
      if (!defined $tmp{"$hash{$file2}{$k2}"}){
        print FH "$hash{$file2}{$k2}\n" ;
        $tmp{"$hash{$file2}{$k2}"} = 1 ;
                }}}} close FH  ;
# End.
%_Host@User>

Thanks good luck.

Dear friend you did the job, that worked. Awesome..Only thing I checked again was the case sensitive scenario, I did a small test changing the entry in file2, making user3 to USER3, that didnt work. Will that be simple change?can you consider that too? because file2 may have entries like that. — BBJinu, Feb 20 '17 at 05:44
Please see my updated comment in the script. It should do it! — User9102d82, Feb 20 '17 at 05:47
Thanks So much that did the work. I will now try to run this on my Three hundred thousand record and going to see how long its going to take. Thanks Again friend. — BBJinu, Feb 20 '17 at 05:52

Fastest way to find lines in file1 which contains any keywords from file2?

1 Answers1