219

I have a series of text files for which I'd like to know the lines in common rather than the lines which are different between them. Command line Unix or Windows is fine.

File foo:

linux-vdso.so.1 =>  (0x00007fffccffe000)
libvlc.so.2 => /usr/lib/libvlc.so.2 (0x00007f0dc4b0b000)
libvlccore.so.0 => /usr/lib/libvlccore.so.0 (0x00007f0dc483f000)
libc.so.6 => /lib/libc.so.6 (0x00007f0dc44cd000)

File bar:

libkdeui.so.5 => /usr/lib/libkdeui.so.5 (0x00007f716ae22000)
libkio.so.5 => /usr/lib/libkio.so.5 (0x00007f716a96d000)
linux-vdso.so.1 =>  (0x00007fffccffe000)

So, given these two files above, the output of the desired utility would be akin to file1:line_number, file2:line_number == matching text (just a suggestion; I really don't care what the syntax is):

foo:1, bar:3 == linux-vdso.so.1 =>  (0x00007fffccffe000)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
matt wilkie
  • 17,268
  • 24
  • 80
  • 115
  • @ChristopherSchultz My mistake. 1st line in 1st example supposed match last line in 2nd example. Thanks for catching the mistake; changing. – matt wilkie Jul 22 '15 at 17:25
  • 2
    Another similar question with good answers: http://unix.stackexchange.com/questions/1079/output-the-common-lines-similarities-of-two-text-files-the-opposite-of-diff – MortezaE Sep 25 '15 at 08:58
  • More general solution: We should submit a patch to **GNU diffutils**, to add an option for this, as it is really just a trivial negation in the equality test. – Evi1M4chine Feb 15 '23 at 11:06
  • In case anyone was interested in writing such a patch: I just took a lengthy look at `diff`’s source code, and it’s not trivial at all, because it’s surprisingly large and messy. There is also no bug tracker, but merely a mailing list. So the best I can recommend, is to request this via mail. (I’d advise a clean rewrite from scratch though. My eyes still hurt. ;) – Evi1M4chine Feb 15 '23 at 11:53

8 Answers8

253

On *nix, you can use comm. The answer to the question is:

comm -1 -2 file1.sorted file2.sorted 
# where file1 and file2 are sorted and piped into *.sorted

Here's the full usage of comm:

comm [-1] [-2] [-3 ] file1 file2
-1 Suppress the output column of lines unique to file1.
-2 Suppress the output column of lines unique to file2.
-3 Suppress the output column of lines duplicated in file1 and file2. 

Also note that it is important to sort the files before using comm, as mentioned in the man pages.

mooreds
  • 4,932
  • 2
  • 32
  • 40
Dan Lew
  • 85,990
  • 32
  • 182
  • 176
  • 3
    comm [-1] [-2] [-3 ] file1 file2 -1 Suppress the output column of lines unique to file1. -2 Suppress the output column of lines unique to file2. -3 Suppress the output column of lines duplicated in file1 and file2. – ojblass Apr 14 '09 at 05:43
  • @ojblass: Added this to the answer. – Matt J Apr 14 '09 at 07:15
  • 10
    I discovered it is important the files be sorted before using comm. Perhaps add that to the answer. – matt wilkie Apr 21 '09 at 16:14
  • 12
    short answer to the question: comm -1 -2 file1 file2 – greggles Nov 02 '12 at 00:16
  • 10
    You can use this if your files aren't sorted: comm -1 -2 <(sort filename1) <(sort filename2) – Kevin Wheeler Dec 10 '15 at 20:30
  • In the "there's more than one way to skin a cat" department, `diff --unchanged-line-format='%L' --old-line-format='' --new-line-format=''` should produce identical output if, for some reason, comm is not available. – user3396385 Oct 11 '16 at 15:23
  • For ppl who are wondering: You can create `file1.sorted` via executing `sort file1 > file1.sorted` – los_floppos Feb 22 '22 at 13:31
  • 1
    And `sort -u file1 > file1.sorted` (--unique) the output will not have any repeated lines. – Max Power Dec 23 '22 at 03:10
80

I found this answer on a question listed as a duplicate. I find grep to be more administrator-friendly than comm, so if you just want the set of matching lines (useful for comparing CSV files, for instance) simply use

grep -F -x -f file1 file2

Or the simplified fgrep version:

fgrep -xf file1 file2

Plus, you can use file2* to glob and look for lines in common with multiple files, rather than just two.

Some other handy variations include

  • -n flag to show the line number of each matched line
  • -c to only count the number of lines that match
  • -v to display only the lines in file2 that differ (or use diff).

Using comm is faster, but that speed comes at the expense of having to sort your files first. It isn't very useful as a 'reverse diff'.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ryder
  • 957
  • 11
  • 14
  • 1
    thanks Ryder, this could more useful than comm to many. You should link to the source answer (there are over half a dozen linked in Q in right-hand nav; it's a bit of work to find). It would also be nice to know how well grep does with un- or differently sorted input, and can print respective line numbers of matches. – matt wilkie Jan 15 '15 at 17:12
  • 3
    @mattwilkie I felt the need to come back and clarify the use of the `-v` flag after I slipped up with it myself. Say you have two csv files file1 and file2, and they have both overlapping and non-overlapping rows. If you want all and only the non-overlapping rows, using `fgrep -v file1 file2` will only return the non-overlapping rows in file2, *and none of the additional non-overlapping rows in file1*. This may be obvious to some, but better to state the obvious than risk misinterpretation. In this particular case, sorting the files and using `comm` is still the better choice. – Ryder May 12 '15 at 08:44
  • 2
    Thank you for coming back and clarifying Ryder. The extra attention is noted and appreciated (all t0o easy to let old things slip away!). I've switched the accepted answer because comm is clearly the community's choice, even though personally I still use this when sorting is unwanted overhead. – matt wilkie May 12 '15 at 18:18
  • 2
    Another complication when using `grep`: any blank line in the first file will match every line in the second file. Make sure `file1` has no blank lines, or it will look like the files are identical. – Christopher Schultz Jul 22 '15 at 14:11
  • `grep -Fxf` it is for me. – loxaxs Mar 17 '18 at 12:03
  • I find this better than `comm` because it is able to catch more similar lines between two different source codes. The idea is that, I want to determine if two source files are related sometime in their past versions. – daparic Sep 05 '18 at 18:30
  • Can we apply this for more than 2 files? – alper Dec 01 '22 at 20:11
  • @alper technically yes; but then the -xf flag will match with any line in all the files following file1, and print out all the lines that match in every file listed after file1. Unless you specify the `-h` flag, the name of the file will be prefixed to every printed line. – Ryder Dec 02 '22 at 22:36
36

It was asked here before: Unix command to find lines common in two files

You could also try with Perl (credit goes here):

perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/'  file1 file2
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ChristopheD
  • 112,638
  • 29
  • 165
  • 179
24

I just learned the comm command from the answers, but I wanted to add something extra: if the files are not sorted, and you don't want to touch the original files, you can pipe the output of the sort command. This leaves the original files intact. It works in Bash, but I can't say about other shells.

comm -1 -2 <(sort file1) <(sort file2)

This can be extended to compare command output, instead of files:

comm -1 -2 <(ls /dir1 | sort) <(ls /dir2 | sort)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Greg Mueller
  • 506
  • 3
  • 7
  • The problem with this is, that you might not want the result to be sorted. Like a program code file. Really, `diff` should have an option for this, just like `patch` has the `-r` option to reverse things. – Evi1M4chine Feb 15 '23 at 11:04
13

The easiest way to do it is:

awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2

Files are not necessary to be sorted.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Gopu
  • 1,022
  • 2
  • 11
  • 20
  • 3
    This is unlike most of the answers here in that it allows you to reconstruct source templates. I have two files built from the same wrapper, with different text inserted at a few points. This answer enabled me to recover the wrapper. – Lucas Gonze Aug 03 '17 at 21:54
  • 1
    Explanation can be found in this question https://stackoverflow.com/q/32481877 or in Idiomatic AWK blog referenced from one of its comments. – Tomáš Záluský Apr 08 '22 at 07:13
6

I think diff utility itself, using its unified (-U) option, can be used to achieve effect. Because the first column of output of diff marks whether the line is an addition, or deletion, we can look for lines that haven't changed.

diff -U1000 file_1 file_2 | grep '^ '

The number 1000 is chosen arbitrarily, big enough to be larger than any single hunk of diff output.

Here's the full, foolproof set of commands:

f1="file_1"
f2="file_2"

lc1=$(wc -l "$f1" | cut -f1 -d' ')
lc2=$(wc -l "$f2" | cut -f1 -d' ')
lcmax=$(( lc1 > lc2 ? lc1 : lc2 ))

diff -U$lcmax "$f1" "$f2" | grep '^ ' | less

# Alternatively, use this grep to ignore the lines starting
# with +, -, and @ signs.
#   grep -vE '^[+@-]'

If you want to include the lines that are just moved around, you can sort the input before diffing, like so:

f1="file_1"
f2="file_2"

lc1=$(wc -l "$f1" | cut -f1 -d' ')
lc2=$(wc -l "$f2" | cut -f1 -d' ')
lcmax=$(( lc1 > lc2 ? lc1 : lc2 ))

diff -U$lcmax <(sort "$f1") <(sort "$f2") | grep '^ ' | less
Gurjeet Singh
  • 2,635
  • 2
  • 27
  • 22
1

Just for information, I made a little tool for Windows doing the same thing as "grep -F -x -f file1 file2" (As I haven't found anything equivalent to this command on Windows)

Here it is: http://www.nerdzcore.com/?page=commonlines

Usage is "CommonLines inputFile1 inputFile2 outputFile"

Source code is also available (GPL).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
1

In Windows, you can use a PowerShell script with CompareObject:

compare-object -IncludeEqual -ExcludeDifferent -PassThru (get-content A.txt) (get-content B.txt)> MATCHING.txt | Out-Null #Find Matching Lines

CompareObject:

  • IncludeEqual without -ExcludeDifferent: Everything
  • ExcludeDifferent without -IncludeEqual: Nothing
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Shrike
  • 66
  • 4