208

I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.

Which command would I use to remove all the addresses that appear in file B from the file A.

So, if file A contained:

A
B
C

and file B contained:

B    
D
E

Then file A should be left with:

A
C

Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.

Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
slhck
  • 36,575
  • 28
  • 148
  • 201
  • 1
    possible duplicate of [Deleting lines from one file which are in another file](http://stackoverflow.com/questions/4780203/deleting-lines-from-one-file-which-are-in-another-file) – tripleee Oct 05 '14 at 17:46
  • 1
    Most if the answers here are for sorted files, and the most obvious one is missing, which of course isn't your fault, but that makes the other one more generally useful. – tripleee Oct 05 '14 at 18:10

12 Answers12

245

If the files are sorted (they are in your example):

comm -23 file1 file2

-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...

See the man page here

Chris Stryczynski
  • 30,145
  • 48
  • 175
  • 286
The Archetypal Paul
  • 41,321
  • 20
  • 104
  • 134
  • 12
    `comm -23 file1 file2 > file3` will output contents in file1 not in file2, to file3. And then `mv file3 file1` would finally clear redundant contents in file1. –  Jul 17 '14 at 20:48
  • 3
    Alternatively, use `comm -23 file1 file2 | sponge file1`. No cleanup needed. – Socowi Mar 13 '18 at 22:08
  • Man page link is not loading for me – alternative: https://linux.die.net/man/1/comm – Felix Rabe Jun 23 '19 at 11:25
  • @Socowi What is sponge? I don't have that on my system. (macos 10.13) – Felix Rabe Jun 23 '19 at 11:29
  • @FelixRabe, well, that's tiresome. Replaced with your link. Thanks – The Archetypal Paul Jun 23 '19 at 16:15
  • @FelixRabe `sponge` is a program that fully consumes stdin before writing it to a file. On linux it is usually installed from a package called `moreutils`. – Socowi Jun 23 '19 at 20:46
  • Did not work for me at all :-( All duplicate lines are still present in the output. – Jeroen-bart Engelen Sep 13 '21 at 13:03
  • @Jeroen-bartEngelen, did you sort the files first? It certainly works (comm has been around for 40+ years...) – The Archetypal Paul Sep 13 '21 at 19:04
  • 1
    @TheArchetypalPaul I figured it out. It was line-endings. It's always line-endings in Linux :-) I edited and sorted both files on my Windows desktop, but for some reason the line-endings were saved differently. Dos2unix helped. – Jeroen-bart Engelen Sep 14 '21 at 22:45
  • The comm command check files in case sensitive, to change all content of file to lowercase, you can use this command: ```awk '{print tolower($0)}' < file1 ``` – MHZarei Apr 09 '23 at 11:45
130

grep -Fvxf <lines-to-remove> <all-lines>

Example:

cat <<EOF > A
b
1
a
0
01
b
1
EOF

cat <<EOF > B
0
1
EOF

grep -Fvxf B A

Output:

b
a
01
b

Explanation:

  • -F: use literal strings instead of the default BRE
  • -x: only consider matches that match the entire line
  • -v: print non-matching
  • -f file: take patterns from the given file

This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?

Here's a quick bash automation for in-line operation:

remove-lines() (
  remove_lines="$1"
  all_lines="$2"
  tmp_file="$(mktemp)"
  grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
  mv "$tmp_file" "$all_lines"
)

GitHub upstream.

usage:

remove-lines lines-to-remove remove-from-this-file

See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
72

awk to the rescue!

This solution doesn't require sorted inputs. You have to provide fileB first.

awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA

returns

A
C

How does it work?

NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.

NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).

a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)

!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.

Note that this can now be used to remove blacklisted words.

$ awk '...' badwords allwords > goodwords

with a slight change it can clean multiple lists and create cleaned versions.

$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...
Community
  • 1
  • 1
karakfa
  • 66,216
  • 7
  • 41
  • 56
  • full marks on this. To use this on the command line in GnuWin32 in Windows replace the single nibbles with double quotes. works a treat. many thanks. – twobob Feb 18 '16 at 01:25
  • This works but how will i be able to redirect the output to fileA in the form of A (With a new line) B – Anand Builders Feb 21 '17 at 15:40
  • I guess you mean `A\nC`, write to a temp file first and overwrite the original file `... > tmp && mv tmp fileA` – karakfa Feb 21 '17 at 15:58
  • Full marks in this from me too. This awk takes all of 1 second to process a file with 104,000 entries :+1: – MitchellK Jun 03 '19 at 10:34
  • When using this in scripts, make sure to first check that `fileB` is not empty (0 bytes long), because if it is, you will get an empty result instead of the expected contents of `fileA`. (Cause: `FNR==NR` will apply to `fileA` then.) – Peter Nowee Oct 20 '19 at 05:20
19

Another way to do the same thing (also requires sorted input):

join -v 1 fileA fileB

In Bash, if the files are not pre-sorted:

join -v 1 <(sort fileA) <(sort fileB)
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
10

You can do this unless your files are sorted

diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a

--new-line-format is for lines that are in file b but not in a --old-.. is for lines that are in file a but not in b --unchanged-.. is for lines that are in both. %L makes it so the line is printed exactly.

man diff

for more details

Pop
  • 12,135
  • 5
  • 55
  • 68
  • 1
    You say this will work unless the files are sorted. What problems occur if they are sorted? What if they are partially sorted? – Carlos Macasaet Sep 24 '15 at 06:14
  • 1
    That was in response to the solution above that suggested usage of `comm` command. `comm` requires the files to be sorted, so if they are sorted you can use that solution as well. You can use this solution regardless of whether the file is sorted or not though –  Apr 11 '16 at 09:15
8

This refinement of @karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.

This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.

# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.

awk -v N=$N -v lookup="$LOOKUP" '
  BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
  !($N in dictionary) {print}'

(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)

peak
  • 105,803
  • 17
  • 152
  • 177
  • This is tougher to use in a corner-case cross platform scenario than the other one liner. However hats off for the performance effort – twobob Feb 18 '16 at 01:27
2

You can use Python:

python -c '
lines_to_remove = set()
with open("file B", "r") as f:
    for line in f.readlines():
        lines_to_remove.add(line.strip())

with open("file A", "r") as f:
    for line in [line.strip() for line in f.readlines()]:
        if line not in lines_to_remove:
            print(line)
'
HelloGoodbye
  • 3,624
  • 8
  • 42
  • 57
2

You can use - diff fileA fileB | grep "^>" | cut -c3- > fileA

This will work for files that are not sorted as well.

Darpan
  • 5,623
  • 3
  • 48
  • 80
2

Just to add to the Python answer to the user above, here is a faster solution:

    python -c '
lines_to_remove = None
with open("partial file") as f:
    lines_to_remove = {line.rstrip() for line in f.readlines()}

remaining_lines = None
with open("full file") as f:
    remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove

with open("output file", "w") as f:
    for line in remaining_lines:
        f.write(line + "\n")
    '

Raising the power of set subtraction.

Rafael
  • 7,002
  • 5
  • 43
  • 52
0

To get the file after removing the lines which appears on another file

comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt

  • 1
    It's good practice on StackOverflow to add an explanation as to why your solution should work. – 4b0 May 11 '21 at 01:44
  • This doesn't really add anything over the accepted answer, except perhaps the tangential tip on how to use a process substitution to sort files which aren't already sorted. – tripleee May 11 '21 at 06:49
0

Here is a one liner that pipes the output of a website and removes the navigation elements using grep and lynx! you can replace lynx with cat FileA and unwanted-elements.txt with FileB.

lynx -dump -accept_all_cookies -nolist -width 1000 https://stackoverflow.com/ | grep -Fxvf unwanted-elements.txt
Omar Khan
  • 141
  • 3
-1

To remove common lines between two files you can use grep, comm or join command.

grep only works for small files. Use -v along with -f.

grep -vf file2 file1 

This displays lines from file1 that do not match any line in file2.

comm is a utility command that works on lexically sorted files. It takes two files as input and produces three text columns as output: lines only in the first file; lines only in the second file; and lines in both files. You can suppress printing of any column by using -1, -2 or -3 option accordingly.

comm -1 -3 file2 file1

This displays lines from file1 that do not match any line in file2.

Finally, there is join, a utility command that performs an equality join on the specified files. Its -v option also allows to remove common lines between two files.

join -v1 -v2 file1 file2
Aakarsh Gupta
  • 195
  • 1
  • 6