168

I have a file f1:

line1
line2
line3
line4
..
..

I want to delete all the lines which are in another file f2:

line2
line8
..
..

I tried something with cat and sed, which wasn't even close to what I intended. How can I do this?

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
lalli
  • 6,083
  • 7
  • 42
  • 55
  • 5
    possible duplicate of [Remove Lines from File which appear in another File](http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file) – Sven Hohenstein Mar 15 '13 at 11:15
  • If you are looking to remove lines from a file that "even contain" strings from another file (for instance partial matches) see http://unix.stackexchange.com/questions/145079/remove-all-lines-in-file-a-which-contain-the-strings-in-file-b – rogerdpack Oct 16 '15 at 17:30

11 Answers11

212

grep -v -x -f f2 f1 should do the trick.

Explanation:

  • -v to select non-matching lines
  • -x to match whole lines only
  • -f f2 to get patterns from f2

One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
gabuzo
  • 7,378
  • 4
  • 28
  • 36
  • 29
    This has O(n²) complexity and will start to take hours to complete once the files contain more than a few K lines. – Arnaud Le Blanc Jan 24 '11 at 10:59
  • 18
    Figuring out which SO suggested algorythms have O(n^2) complexity only has O(n) complexity, but can still take hours to compete. – Dave Jul 18 '12 at 13:45
  • 2
    I just tried this on 2 files of ~2k lines each, and it got killed by the OS (granted, this is a not-so-powerful VM, but still). – Trebor Rude Feb 18 '14 at 01:45
  • 2
    I love the elegance of this; I prefer the speed of Jona Christopher Sahnwal's answer. – Alex Hall Nov 08 '15 at 21:15
  • This method will remove the new-line characters between lines – sdgfsdh Mar 02 '16 at 10:23
  • 1
    @arnaud576875: Are you sure? It depends on the implementation of `grep`. If it preprocesses `f2` properly before it starts searching the search will only take O(n) time. – HelloGoodbye Aug 09 '17 at 16:27
  • @user202729 Fixed. – Bernhard Barker Mar 12 '19 at 12:49
  • @AlexHall I cannot find Jona Christopher Sahnwal's answer... – Digger Sep 24 '20 at 16:21
  • 1
    @Digger maybe their username or display name changed to jcsahnwal? (At this writing, their display name includes the phrase "Reinstate Monica," which I'm guessing is in support of a fired moderator) I _may_ have been referring to this answer: https://stackoverflow.com/a/18477228/1397555 – Alex Hall Sep 26 '20 at 22:34
  • @AlexHall I think you've got it! There is a decent chance that the "Jona Christopher Sahnwal" username you refer to above is now going by "nowjcsahnwaldt Reinstate Monica". – Digger Sep 27 '20 at 15:56
  • used with file with 2m and 4m line completely quickly and fine note the line were just 8 chars long – Fuseteam Nov 17 '20 at 17:12
  • @gabuzo and @Bernhard Barker Thanks for the answer. Say `f1` and `f2` files contain lines with strings such as `[` , (e.g. `f1` is `line 1\n[428` and `f2` is `line 2\n[428` then `grep -v -x -f f1 f2` would give the wrong result (i.e. `line 2\n[428`). I would suggest making a general case by adding `-F` before `-v`, i.e. `grep -F -v -x -f f1 f2`, which would give the desired `line 2` result. – DavidC. Apr 13 '21 at 23:05
80

Try comm instead (assuming f1 and f2 are "already sorted")

comm -2 -3 f1 f2
rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • 5
    I'm not sure `comm` is the solution has the question does not indicates that the lines in `f1` are sorted which is a prerequisite to use `comm` – gabuzo Jan 24 '11 at 09:54
  • 1
    This worked for me, as my files were sorted and had 250,000+ lines in one of them, only 28,000 in the other. Thanks! – Winter May 26 '14 at 22:22
  • 2
    When this works (input files are sorted), this is extremely fast! – Mike Jarvis Sep 04 '15 at 22:09
  • As in arnaud576875's solution, for me using cygwin, this eliminated duplicate lines in the second file which may want to be kept. – Alex Hall Nov 08 '15 at 21:04
  • 15
    You can use process substitution to sort the files first, of course: `comm -2 -3 <(sort f1) <(sort f2)` – davemyron Mar 25 '16 at 16:01
  • Another nice thing about this solution is you can change the `-123` args to get different lists between the two. As long as you combine only two of the args you'll always get a single list ... it's like working with sets in Python or boolean operators ... awesomeness – Neil C. Obremski Apr 20 '22 at 04:16
17

For exclude files that aren't too huge, you can use AWK's associative arrays.

awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt 

The output will be in the same order as the "from-this.txt" file. The tolower() function makes it case-insensitive, if you need that.

The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)

Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • Why do you say files that aren't too huge? The fear here is (I assume) awk running the system out of system memory to create the hash, or is there some other limitation? – rogerdpack Oct 16 '15 at 16:58
  • for followers, there are even other more aggressive option to "sanitize" the lines (since the comparison has to be exact to use the associative array), ex http://unix.stackexchange.com/a/145132/8337 – rogerdpack Oct 16 '15 at 17:36
  • @rogerdpack: A large exclude file will require a large hash array (and a long processing time). A large "from-this.txt" will only require a long processing time. – Dennis Williamson Oct 16 '15 at 17:51
  • 1
    This fails (i.e. doesn't produce any output) if `exclude-these.txt` is empty. @jona-christopher-sahnwaldt 's answer below works in this case. You can also specify multiple files e.g. `awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 done.out failed.out f=2 all-files.out` – Graham Russell Jun 24 '17 at 09:55
  • @GrahamRussell I cannot find Jona Christopher Sahnwal's answer... – Digger Sep 24 '20 at 16:23
  • 1
    @Digger: [Here](https://stackoverflow.com/a/18477228/26428) – Dennis Williamson Jun 28 '22 at 14:03
13

Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR trick):

awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt

Accessing r[$0] creates the entry for that line, no need to set a value.

Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.

  • How does this differ from the Dennis Williamson answer? Is the only difference that it doesn't do an assignment into the hash, so slightly faster than this? Algorithmic complexity is the same as his? – rogerdpack Oct 16 '15 at 16:48
  • The difference is mostly syntactic. I find the variable ```f``` clearer than ```NR == FNR```, but that's a matter of taste. Assignment into the hash should be so fast that there's no measurable speed difference between the two versions. I think I was wrong about complexity - if lookup is constant, update should be constant as well (on average). I don't know why I thought update would be logarithmic. I'll edit my answer. – jcsahnwaldt Reinstate Monica Oct 17 '15 at 11:05
  • I tried a bunch of these answers, and this one was AMAZEBALLS fast. I had files with hundreds of thousands of lines. Worked like a charm! – Mr. T Apr 12 '17 at 04:53
  • 1
    This is my preferred solution. It works with multiple files and also empty exclude files e.g. `awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 empty.file done.out failed.out f=2 all-files.out`. Whereas the other `awk` solution fails with empty exclude file and can only take one. – Graham Russell Jun 30 '17 at 21:09
  • I found that the memory usage needed is roughly **3x** of the `exclude-these.txt` file. For example, my exclude file is 6 GB (in 90M lines), and `awk` seems to need 19 GB resident memory for the associative array. – nh2 Aug 08 '22 at 18:03
  • Regarding throughput: After the associative array is built, `awk` outputs lines at 50 MB/s on an AMD Ryzen 7 3700X backed by a Samsung MZQL2960HCJR NVMe SSD -- not great but not bad. – nh2 Aug 08 '22 at 18:09
5

if you have Ruby (1.9+)

#!/usr/bin/env ruby 
b=File.read("file2").split
open("file1").each do |x|
  x.chomp!
  puts x if !b.include?(x)
end

Which has O(N^2) complexity. If you want to care about performance, here's another version

b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}

which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)

here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:

$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test

real    0m0.639s
user    0m0.554s
sys     0m0.021s

$time sort file1 file2|uniq -u  > sort.test

real    0m2.311s
user    0m1.959s
sys     0m0.040s

$ diff <(sort -n ruby.test) <(sort -n sort.test)
$

diff was used to show there are no differences between the 2 files generated.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
kurumi
  • 25,121
  • 5
  • 44
  • 52
4

Some timing comparisons between various other answers:

$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null

real    0m0.019s
user    0m0.023s
sys     0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null

real    0m0.026s
user    0m0.018s
sys     0m0.007s
$ time grep -xvf f2 f1 > /dev/null

real    0m43.197s
user    0m43.155s
sys     0m0.040s

sort f1 f2 | uniq -u isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.

comm can also be used with stdin and here strings:

echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a
rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Lri
  • 26,768
  • 8
  • 84
  • 82
3

Seems to be a job suitable for the SQLite shell:

create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q
Benoit
  • 76,634
  • 23
  • 210
  • 236
1

Did you try this with sed?

sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh

sed -i 's#$#%%g'"'"' f1#g' f2.sh

sed -i '1i#!/bin/bash' f2.sh

sh f2.sh
Ruan
  • 772
  • 4
  • 13
1
$ cat values.txt
apple
banana
car
taxi

$ cat source.txt
fruits
mango
king
queen
number
23
43
sentence is long
so what
...
...

I made a small shell scrip to "weed" out the values in source file which are present in values.txt file.

$cat weed_out.sh
from=$1
cp -p $from $from.final
for x in `cat values.txt`;
do
 grep -v $x $from.final > $from.final.tmp
 mv $from.final.tmp $from.final
done

executing...

$ ./weed_out source.txt

and you get a nicely cleaned up file....

rajeev
  • 1,275
  • 7
  • 27
  • 45
0

Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.

Obviously won't work for huge files but it did the trick for me. A few notes:

  • I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
  • The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data
youngrrrr
  • 3,044
  • 3
  • 25
  • 42
0

A Python way of filtering one list using another list.

Load files:

>>> f1 = open('f1').readlines()
>>> f2 = open('f2.txt').readlines()

Remove '\n' string at the end of each line:

>>> f1 = [i.replace('\n', '') for i in f1]
>>> f2 = [i.replace('\n', '') for i in f2]

Print only the f1 lines that are also in the f2 file:

>>> [a for a in f1 if all(b not in a for b in f2)]
KSVelArc
  • 90
  • 11