How can I find lines in one file but not the other using bash scripting?

Question

Imagine file 1:

#include "first.h"
#include "second.h"
#include "third.h"

// more code here
...

Imagine file 2:

#include "fifth.h"
#include "second.h"
#include "eigth.h"

// more code here
...

I want to get the headers that are included in file 2, but not in file 1, only those lines. So, when ran, a diff of file 1 and file 2 will produce:

#include "fifth.h"
#include "eigth.h"

I know how to do it in Perl/Python/Ruby, but I'd like to accomplish this without using a different programming language.

For more ways to do the same thing take a look at this [BashFAQ](http://mywiki.wooledge.org/BashFAQ/036). Keep in mind since all of these solutions do line-based pattern matching, you'll have to make sure you format your include lines the same way everywhere. Examples: `#include` will not match `# include` and `"first.h"` will not match `"../first.h"` from a sub-directory, etc. — jw013, Aug 04 '11 at 08:08
possible duplicate of [Remove Lines from File which appear in another File](http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file) — Ciro Santilli OurBigBook.com, Jun 27 '15 at 08:49

score 25 · Answer 1 · answered Aug 03 '11 at 20:39

25

This is a one-liner, but does not preserve the order:

comm -13 <(grep '#include' file1 | sort) <(grep '#include' file2 | sort)

If you need to preserve the order:

awk '
  !/#include/ {next} 
  FILENAME == ARGV[1] {include[$2]=1; next} 
  !($2 in include)
' file1 file2

answered Aug 03 '11 at 20:39

glenn jackman

238,783
38
220
352

More generalized answer here: http://stackoverflow.com/a/5812853/973402; this solution is WAY faster than grep -f when you have a lot of patterns to check against – Joshua Richardson Jan 07 '14 at 21:55

score 9 · Accepted Answer · edited Feb 21 '15 at 14:08

9

If it's ok to use a temp file, try this:

grep include file1.h > /tmp/x && grep -f /tmp/x -v file2.h | grep include

This

extracts all includes from file1.h and writes them to the file /tmp/x
uses this file to get all lines from file2.h that are not contained in this list
extracts all includes from the remainder of file2.h

It probably doesn't handle differences in whitespace correctly etc, though.

EDIT: to prevent false positives, use a different pattern for the last grep (thanks to jw013 for mentioning this):

grep include file1.h > /tmp/x && grep -f /tmp/x -v file2.h | grep "^#include"

edited Feb 21 '15 at 14:08

rubo77

19,527
31
134
226

answered Aug 03 '11 at 20:10

Frank Schmitt

30,195
12
73
107

1

Maybe change that last grep pattern to `'^#include'` unless you also want to see random lines of code where you happened to use the word "include" – jw013 Aug 04 '11 at 07:53
1

when greping for matching lines, you should use the options: `-F` for "fixed-string" (non-regexp) patterns, and `-x` for "whole line" matches. Also, the temp file isn't strictly necessary, you can use `-f -` to take the pattern file from standard in. The resulting command becomes: `grep '^#include' file1.h | grep -f - -vFx file2.h | grep '^#include'` – Lee Oct 24 '13 at 00:51

score 8 · Answer 3 · answered Aug 04 '11 at 07:19

This variant requires an fgrep with the -f option. GNU grep (i.e. any Linux system, and then some) should work fine.

# Find occurrences of '#include' in file1.h
fgrep '#include' file1.h |
# Remove any identical lines from file2.h
fgrep -vxf - file2.h |
# Result is all lines not present in file1.h.  Out of those, extract #includes
fgrep '#include'

This does not require any sorting, nor any explicit temporary files. In theory, fgrep -f could use a temporary file behind the scenes, but I believe GNU fgrep doesn't.

POSIX specifies `-f` so any POSIX compliant `grep` should have it. — jw013, Aug 04 '11 at 08:07

score 6 · Answer 4 · answered Oct 08 '14 at 21:51

6

If the goal need not be accomplished with Bash alone (i.e., use of external programs is acceptable), then use combine from moreutils:

combine file1 not file2 > lines_in_file1_not_in_file2

answered Oct 08 '14 at 21:51

pmocek

163
1
2

score 2 · Answer 5 · answered Oct 21 '13 at 13:58

2

cat $file1 $file2 | grep '#include' | sort | uniq -u

answered Oct 21 '13 at 13:58

plbogen

61
5

This will list `#include` lines unique to file1 or file2. I think that you want `cat $file1 $file1 $file2 | grep '#include' | sort | uniq -u`, with file1 repeated so that its `#include` lines are doubled and will then be filtered by the `uniq -u`. – esmit Dec 13 '13 at 00:19
And since `grep` can read multiple input files, you can use `grep -h` and do away with the (only moderately useless) `cat`. – tripleee Mar 15 '14 at 12:23

How can I find lines in one file but not the other using bash scripting?

5 Answers5