Deleting lines from one file which are in another file

Question

I have a file f1:

line1
line2
line3
line4
..
..

I want to delete all the lines which are in another file f2:

line2
line8
..
..

I tried something with cat and sed, which wasn't even close to what I intended. How can I do this?

possible duplicate of [Remove Lines from File which appear in another File](http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file) — Sven Hohenstein, Mar 15 '13 at 11:15
If you are looking to remove lines from a file that "even contain" strings from another file (for instance partial matches) see http://unix.stackexchange.com/questions/145079/remove-all-lines-in-file-a-which-contain-the-strings-in-file-b — rogerdpack, Oct 16 '15 at 17:30

score 212 · Accepted Answer · edited Mar 12 '19 at 12:49

212

grep -v -x -f f2 f1 should do the trick.

Explanation:

-v to select non-matching lines
-x to match whole lines only
-f f2 to get patterns from f2

One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).

edited Mar 12 '19 at 12:49

Bernhard Barker

54,589
14
104
138

answered Jan 24 '11 at 09:06

gabuzo

7,378
4
28
36

29

This has O(n²) complexity and will start to take hours to complete once the files contain more than a few K lines. – Arnaud Le Blanc Jan 24 '11 at 10:59
18

Figuring out which SO suggested algorythms have O(n^2) complexity only has O(n) complexity, but can still take hours to compete. – Dave Jul 18 '12 at 13:45
2

I just tried this on 2 files of ~2k lines each, and it got killed by the OS (granted, this is a not-so-powerful VM, but still). – Trebor Rude Feb 18 '14 at 01:45
2

I love the elegance of this; I prefer the speed of Jona Christopher Sahnwal's answer. – Alex Hall Nov 08 '15 at 21:15
This method will remove the new-line characters between lines – sdgfsdh Mar 02 '16 at 10:23
1

@arnaud576875: Are you sure? It depends on the implementation of `grep`. If it preprocesses `f2` properly before it starts searching the search will only take O(n) time. – HelloGoodbye Aug 09 '17 at 16:27
@user202729 Fixed. – Bernhard Barker Mar 12 '19 at 12:49
@AlexHall I cannot find Jona Christopher Sahnwal's answer... – Digger Sep 24 '20 at 16:21
1

@Digger maybe their username or display name changed to jcsahnwal? (At this writing, their display name includes the phrase "Reinstate Monica," which I'm guessing is in support of a fired moderator) I _may_ have been referring to this answer: https://stackoverflow.com/a/18477228/1397555 – Alex Hall Sep 26 '20 at 22:34
@AlexHall I think you've got it! There is a decent chance that the "Jona Christopher Sahnwal" username you refer to above is now going by "nowjcsahnwaldt Reinstate Monica". – Digger Sep 27 '20 at 15:56
used with file with 2m and 4m line completely quickly and fine note the line were just 8 chars long – Fuseteam Nov 17 '20 at 17:12
@gabuzo and @Bernhard Barker Thanks for the answer. Say `f1` and `f2` files contain lines with strings such as `[` , (e.g. `f1` is `line 1\n[428` and `f2` is `line 2\n[428` then `grep -v -x -f f1 f2` would give the wrong result (i.e. `line 2\n[428`). I would suggest making a general case by adding `-F` before `-v`, i.e. `grep -F -v -x -f f1 f2`, which would give the desired `line 2` result. – DavidC. Apr 13 '21 at 23:05

score 80 · Answer 2 · edited Oct 16 '15 at 16:50

80

Try comm instead (assuming f1 and f2 are "already sorted")

comm -2 -3 f1 f2

edited Oct 16 '15 at 16:50

rogerdpack

62,887
36
269
388

answered Jan 24 '11 at 09:07

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

5

I'm not sure `comm` is the solution has the question does not indicates that the lines in `f1` are sorted which is a prerequisite to use `comm` – gabuzo Jan 24 '11 at 09:54
1

This worked for me, as my files were sorted and had 250,000+ lines in one of them, only 28,000 in the other. Thanks! – Winter May 26 '14 at 22:22
2

When this works (input files are sorted), this is extremely fast! – Mike Jarvis Sep 04 '15 at 22:09
As in arnaud576875's solution, for me using cygwin, this eliminated duplicate lines in the second file which may want to be kept. – Alex Hall Nov 08 '15 at 21:04
15

You can use process substitution to sort the files first, of course: `comm -2 -3 <(sort f1) <(sort f2)` – davemyron Mar 25 '16 at 16:01
Another nice thing about this solution is you can change the `-123` args to get different lists between the two. As long as you combine only two of the args you'll always get a single list ... it's like working with sets in Python or boolean operators ... awesomeness – Neil C. Obremski Apr 20 '22 at 04:16

Dennis Williamson · Answer 3 · 2015-10-16T17:51:35.807

17

For exclude files that aren't too huge, you can use AWK's associative arrays.

awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt

The output will be in the same order as the "from-this.txt" file. The tolower() function makes it case-insensitive, if you need that.

The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)

edited Oct 16 '15 at 17:51

answered Jan 24 '11 at 16:51

Dennis Williamson

346,391
90
374
439

Why do you say files that aren't too huge? The fear here is (I assume) awk running the system out of system memory to create the hash, or is there some other limitation? – rogerdpack Oct 16 '15 at 16:58
for followers, there are even other more aggressive option to "sanitize" the lines (since the comparison has to be exact to use the associative array), ex http://unix.stackexchange.com/a/145132/8337 – rogerdpack Oct 16 '15 at 17:36
@rogerdpack: A large exclude file will require a large hash array (and a long processing time). A large "from-this.txt" will only require a long processing time. – Dennis Williamson Oct 16 '15 at 17:51
1

This fails (i.e. doesn't produce any output) if `exclude-these.txt` is empty. @jona-christopher-sahnwaldt 's answer below works in this case. You can also specify multiple files e.g. `awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 done.out failed.out f=2 all-files.out` – Graham Russell Jun 24 '17 at 09:55
@GrahamRussell I cannot find Jona Christopher Sahnwal's answer... – Digger Sep 24 '20 at 16:23
1

@Digger: [Here](https://stackoverflow.com/a/18477228/26428) – Dennis Williamson Jun 28 '22 at 14:03

jcsahnwaldt Reinstate Monica · Answer 4 · 2015-11-10T14:49:27.703

13

Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR trick):

awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt

Accessing r[$0] creates the entry for that line, no need to set a value.

Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.

edited Nov 10 '15 at 14:49

answered Aug 27 '13 at 23:48

jcsahnwaldt Reinstate Monica

3,434
4
35
33

How does this differ from the Dennis Williamson answer? Is the only difference that it doesn't do an assignment into the hash, so slightly faster than this? Algorithmic complexity is the same as his? – rogerdpack Oct 16 '15 at 16:48
The difference is mostly syntactic. I find the variable ```f``` clearer than ```NR == FNR```, but that's a matter of taste. Assignment into the hash should be so fast that there's no measurable speed difference between the two versions. I think I was wrong about complexity - if lookup is constant, update should be constant as well (on average). I don't know why I thought update would be logarithmic. I'll edit my answer. – jcsahnwaldt Reinstate Monica Oct 17 '15 at 11:05
I tried a bunch of these answers, and this one was AMAZEBALLS fast. I had files with hundreds of thousands of lines. Worked like a charm! – Mr. T Apr 12 '17 at 04:53
1

This is my preferred solution. It works with multiple files and also empty exclude files e.g. `awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 empty.file done.out failed.out f=2 all-files.out`. Whereas the other `awk` solution fails with empty exclude file and can only take one. – Graham Russell Jun 30 '17 at 21:09
I found that the memory usage needed is roughly **3x** of the `exclude-these.txt` file. For example, my exclude file is 6 GB (in 90M lines), and `awk` seems to need 19 GB resident memory for the associative array. – nh2 Aug 08 '22 at 18:03
Regarding throughput: After the associative array is built, `awk` outputs lines at 50 MB/s on an AMD Ryzen 7 3700X backed by a Samsung MZQL2960HCJR NVMe SSD -- not great but not bad. – nh2 Aug 08 '22 at 18:09

score 5 · Answer 5 · edited Oct 16 '15 at 17:10

5

if you have Ruby (1.9+)

#!/usr/bin/env ruby 
b=File.read("file2").split
open("file1").each do |x|
  x.chomp!
  puts x if !b.include?(x)
end

Which has O(N^2) complexity. If you want to care about performance, here's another version

b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}

which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)

here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:

$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test

real    0m0.639s
user    0m0.554s
sys     0m0.021s

$time sort file1 file2|uniq -u  > sort.test

real    0m2.311s
user    0m1.959s
sys     0m0.040s

$ diff <(sort -n ruby.test) <(sort -n sort.test)
$

diff was used to show there are no differences between the 2 files generated.

edited Oct 16 '15 at 17:10

rogerdpack

62,887
36
269
388

answered Jan 24 '11 at 09:41

kurumi

25,121
5
44
52

1

This has O(n²) complexity and will start to take hours to complete once the files contain more than a few K lines. – Arnaud Le Blanc Jan 24 '11 at 11:00
i don't really care at this juncture, because he did not mention any big files. – kurumi Jan 24 '11 at 11:18
3

There's no need to be so defensive, it's not as if @user576875 downvoted your answer or anything. :-) – John Parker Jan 24 '11 at 11:33
very nice second version, ruby wins :) – Arnaud Le Blanc Jan 24 '11 at 12:27

score 4 · Answer 6 · edited Oct 16 '15 at 17:12

Some timing comparisons between various other answers:

$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null

real    0m0.019s
user    0m0.023s
sys     0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null

real    0m0.026s
user    0m0.018s
sys     0m0.007s
$ time grep -xvf f2 f1 > /dev/null

real    0m43.197s
user    0m43.155s
sys     0m0.040s

sort f1 f2 | uniq -u isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.

comm can also be used with stdin and here strings:

echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a

score 3 · Answer 7 · answered Jan 24 '11 at 11:51

3

Seems to be a job suitable for the SQLite shell:

create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q

answered Jan 24 '11 at 11:51

Benoit

76,634
23
210
236

Awesome! Took like 1 sek or so for my 70k vs 10k file! Thanks!!! – Georg May 21 '21 at 09:43

Ruan · Answer 8 · 2015-05-09T17:56:12.027

1

Did you try this with sed?

sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh

sed -i 's#$#%%g'"'"' f1#g' f2.sh

sed -i '1i#!/bin/bash' f2.sh

sh f2.sh

edited May 09 '15 at 17:56

answered May 09 '15 at 14:48

Ruan

772
4
13

score 1 · Answer 9 · answered Jan 24 '22 at 19:11

$ cat values.txt
apple
banana
car
taxi

$ cat source.txt
fruits
mango
king
queen
number
23
43
sentence is long
so what
...
...

I made a small shell scrip to "weed" out the values in source file which are present in values.txt file.

$cat weed_out.sh
from=$1
cp -p $from $from.final
for x in `cat values.txt`;
do
 grep -v $x $from.final > $from.final.tmp
 mv $from.final.tmp $from.final
done

executing...

$ ./weed_out source.txt

and you get a nicely cleaned up file....

score 0 · Answer 10 · answered Dec 11 '19 at 21:56

Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.

Obviously won't work for huge files but it did the trick for me. A few notes:

I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data

score 0 · Answer 11 · answered Nov 18 '21 at 20:31

A Python way of filtering one list using another list.

Load files:

>>> f1 = open('f1').readlines()
>>> f2 = open('f2.txt').readlines()

Remove '\n' string at the end of each line:

>>> f1 = [i.replace('\n', '') for i in f1]
>>> f2 = [i.replace('\n', '') for i in f2]

Print only the f1 lines that are also in the f2 file:

>>> [a for a in f1 if all(b not in a for b in f2)]

Deleting lines from one file which are in another file

11 Answers11

Linked

Related