Use awk sed command and while loop to remove entries from second file

Question

I have two output files:

FILE-A contains 70,000+ unique entries.
FILE-B contains a unique listing that I need to remove from FILE-B.

FILE-A:

 TOM
 JACK
 AILEY
 BORG
 ROSE
 ELI

FILE-B Content:

 TOM
 ELI

I want to remove anything listed in FILE-B from File-A.

FILE-C (Result file):

 JACK
 AILEY
 BORG
 ROSE

I assume I need a while r for i statement. Can someone help me with this? I need to cat and read FILE-A and for every line in FILE-B I need to remove that from FILE-A.

What command should I use?

You want the set of unique lines between both files? See http://backreference.org/2010/02/10/idiomatic-awk/ (search for "suppress duplicated lines"). It is a one-line awk solution. Also [this question](http://stackoverflow.com/q/2604088/258523) — Etan Reisner, Jul 17 '15 at 20:02
The best site for this Q would be http://unix.stackexchange.com/. Shell programming stuff easily gets lost in the clutter on SO. — Peter Cordes, Jul 18 '15 at 03:11

score 6 · Answer 1 · answered Jul 17 '15 at 20:04

6

You don't need either awk, sed, or a loop. You just need grep:

fgrep -vxf FILE-B FILE-A

Please note the use of -x to match entries exactly.

Output:

JACK
AILEY
BORG
ROSE

answered Jul 17 '15 at 20:04

lcd047

5,731
2
28
38

You might also want `-F` to interpret the lines in FILE-B as fixed strings, rather than regexes. – Peter Cordes Jul 18 '15 at 03:09
2

@PeterCordes `fgrep` always interprets the patterns as plain strings rather than regexps. It's exactly the same as `grep -F`. – lcd047 Jul 18 '15 at 04:53
1

Invocation as `fgrep` is deprecated. see `man grep` for details. – Jahid Jul 18 '15 at 04:58
1

Really? Well, none of my *BSD machines seem to have received that memo. :) Seriously, `fgrep` has been in use for 40+ years, do you really think it's going to go away just because somebody at GNU wishes it to do that? – lcd047 Jul 18 '15 at 05:08
Mine is `GNU grep`. It says: `Direct invocation as either egrep or fgrep is deprecated, but is provided to allow historical applications that rely on them to run unmodified`. It's better to use `grep -F` instead for better support, not all of the systems have BSD `grep`.... – Jahid Jul 18 '15 at 05:36
@Jahid Your argument works the other way around too: not all systems have (or care about) GNU `grep`. However, all systems _do_ have `fgrep` right now, they always did, and like I said, they are probably still going to keep it for a long while. So please, keep preaching about `grep -F` to the GNU circles if you must, but please refrain from chastising people on general UNIX forums for using `fgrep`. They aren't doing anything wrong. – lcd047 Jul 18 '15 at 05:45
@lcd047: I somehow failed to notice that you *did* use `fgrep` in the first place. GNU is probably never going to remove the alternate names for invoking `grep`, so I agree with recommending `fgrep`, even though `-F` is specified by POSIX. – Peter Cordes Jul 18 '15 at 06:05
1

It **doesn't** work the other way around. `grep -F` is available everywhere and nowhere it is deprecated. all systems have `grep -F` and will for so long, but `fgrep` , no.. `grep -F` is obviously the safest choice. – Jahid Jul 18 '15 at 06:07
@Jahid Ah, the smell of a good pissing content on a Saturday morning. :) `grep -F` is a POSIX invention, while `fgrep` and `egrep` were introduced by Aho for SysV, long before POSIX came to be. Which means `fgrep` works on more systems than `grep -F`, and that obviously that makes it the safer choice. Nyah, nyah. :) – lcd047 Jul 18 '15 at 06:28
How about you post your claim as an answer to [this question](http://stackoverflow.com/q/31490686/3744681), though it seems it will be closed pretty soon because it is a bit opinion based. – Jahid Jul 18 '15 at 11:46
@lcd047 - whether or not it is a POSIX invention is irrelevant. It is the standardized way of doing a `grep` for `-F`ixed strings. If you're working on a system which claims POSIX conformance the right way to do it is w/ `grep -F`. If you are not working on such a system, then if you *do* have a `grep` then the best you can do is guess how you should call it at all. In my opinion, one situation should pretty obviously be preferred to the other. – mikeserv Jul 18 '15 at 15:27
@mikeserv That's fine: I'm not using a system that [claims POSIX conformance](https://en.wikipedia.org/wiki/POSIX#POSIX-certified). I never claimed or implied that I did. Actually, I just wrote an answer, and my only claim was that my answer is adequate to the OP's needs. Can you guys show a single situation when it isn't? If you can, I'll happily delete my answer. If you can't, then you aren't commenting on my answer, but pushing an agenda, and maybe you should consider doing that in a more appropriate place? _shrug_ – lcd047 Jul 18 '15 at 16:52
@mikeserv On a side note: `-w`, `-o`, `-a`, and many others are not [POSIX](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html) options for `grep` either. Do you fight with equal fervor to expunge them from SO / SE answers? Do you avoid them in your own answers? – lcd047 Jul 18 '15 at 17:06
@mikeserv Except [when you aren't](http://unix.stackexchange.com/a/122757/111878), right? Sorry, that was petty. But not more petty than all the exchange above. – lcd047 Jul 18 '15 at 17:37
@lcd047 - respectively: No, yes. And there's no argument about the content of your answer made here, only about which form should be preferred. Anf by the way, *correction* some BSDs do claim POSIX-conformance, but that doesn't make them certified. [FREEBSD](http://people.freebsd.org/~schweikh/posix-utilities.html) – mikeserv Jul 18 '15 at 17:37
What? I do take care to avoid -[woa]. Compare some 1050 of my answers and i think you'll find non-posix options to any utility in a very significant minority. I also avoid `bash`isms as much as possible *(especially because `bash` is godawful slow)*. You did have to go back more than a year for that one. And, in honesty, i have learned much since then. I only really even started learning the CLI about two years ago. And i don't understand what's petty...? – mikeserv Jul 18 '15 at 17:40
...sorry, i forgot to up vote it. It's a technique I use often. Again, i don't argue against the suitability of your answer to the question - I'm sure they're well-aligned w/ one another. I'm only saying the the standardized form is the way to go as it leaves less to the imagination - which is always a plus in my book. And so my argument is with your own here in the comments, and is not directly related to the answer at all. – mikeserv Jul 18 '15 at 17:49

anubhava · Answer 2 · 2015-07-18T05:19:39.153

4

You can use grep -v -f:

grep -xFvf FILE-B FILE-A
ACK
AILEY
BORG
ROSE

edited Jul 18 '15 at 05:19

answered Jul 17 '15 at 20:02

anubhava

761,203
64
569
643

karakfa · Answer 3 · 2015-07-17T20:35:23.573

1

If you start with sorted input, the tool for this task is comm

comm -23 FILE-A FILE-B

the option argument means

-2              suppress lines unique to FILE-B
-3              suppress lines that appear in both files

if not sorted initially, you can do the following

comm -23 <(sort FILE-A) <(sort FILE-B)

edited Jul 17 '15 at 20:35

answered Jul 17 '15 at 20:29

karakfa

66,216
7
41
56

Jahid · Answer 4 · 2015-07-22T03:45:53.473

1

You don't need any loop, single awk or sed command is enough:

awk:

awk 'FNR==NR {a[$0];next} !($0 in a)' FILE-B FILE-A >FILE-C

sed:

sed "s=^=/^=;s=$=$/d=" FILE-B | sed -f- FILE-A >FILE-C

Note:

While the sed version works for the data shown, it won't handle any text in FILE-B which can be interpreted as a regex pattern.
The awk solution reads FILE-B entirely into memory. It doesn't have the limitation of interpreting text as regex like the sed solution.

edited Jul 22 '15 at 03:45

answered Jul 18 '15 at 04:35

Jahid

21,542
10
90
108

The `awk` solution reads `FILE-B` entirely in memory. The `sed` solution assumes a `sed` that can read scripts from `stdin` (i.e. only GNU `sed`, AFAIK). – Sato Katsura Jul 19 '15 at 11:01
Just a note: Though the `sed` version works for the data shown, it won't handle any text in FILE-B which can be interpreted as a regex pattern - the `awk` version can hanle such an issue. – Peter.O Jul 21 '15 at 18:55
@Peter.O I was wondering who is gonna notice that first. That is the one valid and serious pitfall of the `sed` solution. It van be overcome by sanitizing FILE-B though. Anyway, I added your note. thnks – Jahid Jul 22 '15 at 03:50
@SatoKatsura I added your note too. thnks – Jahid Jul 22 '15 at 03:53

Use awk sed command and while loop to remove entries from second file

4 Answers4

Linked