How to remove common lines between two files without sorting?

Question

I have two files not sortered which have some lines in common.

file1.txt

Z
B
A
H
L

file2.txt

S
L
W
Q
A

The way I'm using to remove common lines is the following:

sort -u file1.txt > file1_sorted.txt
sort -u file2.txt > file2_sorted.txt

comm -23 file1_sorted.txt file2_sorted.txt > file_final.txt

Output:

B
H
Z

The problem is that I want to keep the order of file1.txt, I mean:

Desired output:

Z
B
H

One solution I tought is doing a loop to read all the lines of file2.txt and:

sed -i '/^${line_file2}$/d' file1.txt

But if files are big the performance may suck.

Do you like my idea?
Do you have any alternative to do it?

score 31 · Answer 1 · answered Jun 20 '14 at 09:47

31

You can use just grep (-v for invert, -f for file). Grep lines from input1 that do not match any line in input2:

grep -vf input2 input1

Gives:

Z
B
H

answered Jun 20 '14 at 09:47

perreal

94,503
21
155
181

6

would it be better if it with option `-F -w or -x` ? e.g. substring case. – Kent Jun 20 '14 at 09:54
This works to compare whether *entire lines* are equal: `grep -vxf input2 input1`. Also grep on macOS (grep version 2.5.1) is being weird and doesn't give any results, so i had to use grep from Homebrew which is gnu-grep. – Motsel Feb 18 '19 at 14:04

score 27 · Accepted Answer · answered Jun 20 '14 at 09:49

27

grep or awk:

awk 'NR==FNR{a[$0]=1;next}!a[$0]' file2 file1

answered Jun 20 '14 at 09:49

Kent

189,393
32
233
301

`a[$0]=7` Why equal to seven? Thanks! :) – mllamazares Jun 20 '14 at 09:50
3

@JohnDoe we just need a non-zero number, 7 and 1 have no difference. I change it into 1, if it makes you feel comfortable. :-) – Kent Jun 20 '14 at 09:51
1

Yeah, I feel much better now. :) – mllamazares Jun 20 '14 at 09:57
2

Actually this is the best method to do it. Is faster that the grep method. Thank you! :D – mllamazares Sep 11 '14 at 15:18
1

@JohnDoe it should be faster than the grep, because array in awk is hashtable, checking key used hash function, which would be `O(1)`, the grep one needs `O(n^2)` however, my awk line will save file2 in memory. so the space complexity is larger than the grep line. – Kent Sep 11 '14 at 15:31
1

This was a lifesaver. awk rocks! – A. K. Oct 31 '18 at 16:54
If `file2` is empty(i.e. `rm -f file2; touch file2`), this awk command won't work – KaiserKatze Jun 17 '19 at 06:47

score 4 · Answer 3 · answered Jun 20 '14 at 11:28

I've written a little Perl script that I use for this kind of thing. It can do more than what you ask for but it can also do what you need:

#!/usr/bin/env perl -w
use strict;
use Getopt::Std;
my %opts;
getopts('hvfcmdk:', \%opts);
my $missing=$opts{m}||undef;
my $column=$opts{k}||undef;
my $common=$opts{c}||undef;
my $verbose=$opts{v}||undef;
my $fast=$opts{f}||undef;
my $dupes=$opts{d}||undef;
$missing=1 unless $common || $dupes;;
&usage() unless $ARGV[1];
&usage() if $opts{h};
my (%found,%k,%fields);
if ($column) {
    die("The -k option only works in fast (-f) mode\n") unless $fast;
    $column--; ## So I don't need to count from 0
}

open(my $F1,"$ARGV[0]")||die("Cannot open $ARGV[0]: $!\n");
while(<$F1>){
    chomp;
    if ($fast){ 
    my @aa=split(/\s+/,$_);
    $k{$aa[0]}++;   
        $found{$aa[0]}++;
    }
    else {
    $k{$_}++;   
        $found{$_}++;
    }
}
close($F1);
my $n=0;
open(F2,"$ARGV[1]")||die("Cannot open $ARGV[1]: $!\n");
my $size=0;
if($verbose){
    while(<F2>){
        $size++;
    }
}
close(F2);
open(F2,"$ARGV[1]")||die("Cannot open $ARGV[1]: $!\n");

while(<F2>){
    next if /^\s+$/;
    $n++;
    chomp;
    print STDERR "." if $verbose && $n % 10==0;
    print STDERR "[$n of $size lines]\n" if $verbose && $n % 800==0;
    if($fast){
        my @aa=split(/\s+/,$_);
        $k{$aa[0]}++ if defined($k{$aa[0]});
        $fields{$aa[0]}=\@aa if $column;
    }
    else{
        my @keys=keys(%k);
        foreach my $key(keys(%found)){
            if (/\Q$key/){
            $k{$key}++ ;
            $found{$key}=undef unless $dupes;
            }
        }
    }
}
close(F2);
print STDERR "[$n of $size lines]\n" if $verbose;

if ($column) {
    $missing && do map{my @aa=@{$fields{$_}}; print "$aa[$column]\n" unless $k{$_}>1}keys(%k);
    $common &&  do map{my @aa=@{$fields{$_}}; print "$aa[$column]\n" if $k{$_}>1}keys(%k);
    $dupes &&   do map{my @aa=@{$fields{$_}}; print "$aa[$column]\n" if $k{$_}>2}keys(%k);
}
else {
    $missing && do map{print "$_\n" unless $k{$_}>1}keys(%k);
    $common &&  do map{print "$_\n" if $k{$_}>1}keys(%k);
    $dupes &&   do map{print "$_\n" if $k{$_}>2}keys(%k);
}
sub usage{
    print STDERR <<EndOfHelp;

  USAGE: compare_lists.pl FILE1 FILE2

      This script will compare FILE1 and FILE2, searching for the 
      contents of FILE1 in FILE2 (and NOT vice versa). FILE one must 
      be one search pattern per line, the search pattern need only be 
      contained within one of the lines of FILE2.

    OPTIONS: 
      -c : Print patterns COMMON to both files
      -f : Search only the first characters of each line of FILE2
      for the search pattern given in FILE1
      -d : Print duplicate entries     
      -m : Print patterns MISSING in FILE2 (default)
      -h : Print this help and exit
EndOfHelp
      exit(0);
}

In your case, you would run it as

list_compare.pl -cf file1.txt file2.txt

The -f option makes it compare only the first word (defined by whitespace) of file2 and greatly speeds things up. To compare the entire line, remove the -f.

How to remove common lines between two files without sorting?

3 Answers3

Linked