Shell command to find lines common in two files

Question

I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?

It was much simpler than diff.

The answers to this question aren't necessarily what everyone will want, since `comm` requires sorted input files. If you want just line-by-line common, it's great. But if you want what I would call "anti-diff", `comm` doesn't do the job. — Robert P. Goldman, Apr 20 '12 at 14:15
@RobertP.Goldman is there a way to get common between two files when file1 contains partial pattern like `pr-123-xy-45` and file2 contains `ec11_orop_pr-123-xy-45.gz` . I need file3 containing `ec11_orop_pr-123-xy-45.gz` — Chandan Choudhury, Nov 02 '15 at 07:20
[See this](https://stackoverflow.com/questions/29244351) for sorting text-files line-by-line — y2k-shubham, Jul 25 '18 at 07:29

score 284 · Accepted Answer · edited Jun 27 '17 at 09:58

284

The command you are seeking is comm. eg:-

comm -12 1.sorted.txt 2.sorted.txt

Here:

-1 : suppress column 1 (lines unique to 1.sorted.txt)

-2 : suppress column 2 (lines unique to 2.sorted.txt)

edited Jun 27 '17 at 09:58

Mohammed H

6,880
16
81
127

answered Dec 17 '08 at 06:40

Jonathan Leffler

730,956
141
904
1,278

28

Typical usage : comm -12 1.sorted.txt 2.sorted.txt – Fedir RYKHTIK Jun 11 '13 at 15:54
54

While comm needs sorted files, you may take grep -f file1 file2 to get the common lines of both files. – ferdy Jan 20 '15 at 17:29
4

@ferdy (Repeating my comment from your answer, as yours is essentially a repeated answer posted as a comment) `grep` does some weird things you might not expect. Specifically, everything in `1.txt` will be interpreted as a regular expression and not a plain string. Also, any blank line in `1.txt` will match all lines in `2.txt`. So `grep` will only work in very specific situations. You'd at least want to use `fgrep` (or `grep -f`) but the blank-line thing is probably going to wreak havoc on this process. – Christopher Schultz Jul 22 '15 at 14:08
15

See [ferdy](http://stackoverflow.com/users/61903/ferdy)'s [answer](http://stackoverflow.com/a/28051421/15168) below, and [Christopher Schultz](http://stackoverflow.com/users/276232/christopher-schultz)'s and my comments on it. TL;DR — use `grep -F -x -f file1 file2`. – Jonathan Leffler Jul 22 '15 at 14:31
@JonathanLeffler How would one have the outputs in different files? – bapors Sep 20 '17 at 09:47
@bapors: I'm not sure what you are asking. If you want the lines only in File1 in one file, those only in File2 in another, and those in both in a third, then (provided that none of the lines in the files starts with a tab)you could use`sed` to split the output to three files. But is that what you're asking? – Jonathan Leffler Sep 20 '17 at 13:24
@JonathanLeffler yes, it is exactly what I was asking. I am not very confident in `sed`, would you show an example if it is okay? – bapors Sep 21 '17 at 01:27
1

@bapors: I've provided a self-answered Q&A as [How to get the output from the `comm` command into 3 separate files?](https://stackoverflow.com/questions/46336404/how-to-get-the-output-from-the-comm-command-into-3-separate-files/46336405#46336405) The answer was much too big to fit comfortably here. – Jonathan Leffler Sep 21 '17 at 05:56
Does it require the files have same number of lines? – Hi-Angel Jan 15 '18 at 21:01
1

@Hi-Angel — no, the files can be different sizes. – Jonathan Leffler Jan 15 '18 at 21:37

score 72 · Answer 2 · edited Jul 22 '15 at 14:29

72

To easily apply the comm command to unsorted files, use Bash's process substitution:

$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321

So the files abc and def have one line in common, the one with "132". Using comm on unsorted files:

$ comm abc def
123
    132
567
132
    777
    321
$ comm -12 abc def # No output! The common line is not found
$

The last line produced no output, the common line was not discovered.

Now use comm on sorted files, sorting the files with process substitution:

$ comm <( sort abc ) <( sort def )
123
            132
    321
567
    777
$ comm -12 <( sort abc ) <( sort def )
132

Now we got the 132 line!

edited Jul 22 '15 at 14:29

Jonathan Leffler

730,956
141
904
1,278

answered Jul 20 '14 at 13:57

Stephan Wehner

1,099
8
9

2

so... `sort abc > abc.sorted`, `sort dev > def.sorted` and then `comm -12 abc.sorted def.sorted` ? – Nikana Reklawyks Nov 01 '17 at 01:28
2

@NikanaReklawyks And then remember to remove the temporary files afterwards, and cope with cleaning up in case of an error. In many scenarios, the process substitution will also be a lot quicker because you can avoid the disk I/O as long as the results fit into memory. – tripleee Dec 08 '17 at 05:41

score 39 · Answer 3 · edited Jul 22 '15 at 14:36

39

To complement the Perl one-liner, here's its awk equivalent:

awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2

This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2. Note that the comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.

edited Jul 22 '15 at 14:36

Jonathan Leffler

730,956
141
904
1,278

answered Oct 11 '14 at 21:50

Tatjana Heuser

964
9
11

2

THIS(!) is the correct answer. None of the others can be made to work generally (I haven't tried the `perl` ones, because). Thanks a million, Ms. – entonio May 30 '16 at 09:48
1

Preserving the order when displaying the common lines can be really useful in some cases that would exclude comm because of that. – tuxayo Jul 13 '16 at 13:07
1

In case anybody wants to do the same thing based on a certain column but doesn't know awk, just replace both $0's with $5's for example for column 5 so you get lines shared in 2 files with same words in column 5 – FatihSarigol Jan 31 '19 at 15:15

Johannes Schaub - litb · Answer 4 · 2008-12-17T06:47:37.850

Maybe you mean comm ?

Compare sorted files FILE1 and FILE2 line by line.

With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.

The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils and it will list you all the small useful utils.

score 21 · Answer 5 · edited Mar 21 '22 at 05:14

21

While

fgrep -v -f 1.txt 2.txt > 3.txt

gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a

fgrep -f 1.txt 2.txt > 3.txt

to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm nonetheless. Regards!

Note: You can use grep -F instead of fgrep.

edited Mar 21 '22 at 05:14

haridsv

9,065
4
62
65

answered Jan 20 '15 at 17:21

ferdy

7,366
3
35
46

3

`grep` does some weird things you might not expect. Specifically, everything in `1.txt` will be interpreted as a regular expression and not a plain string. Also, any blank line in `1.txt` will match all lines in `2.txt`. So this will only work in very specific situations. – Christopher Schultz Jul 22 '15 at 14:05
16

@ChristopherSchultz: It's possible to upgrade this answer to work better using POSIX [`grep`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html) notations, which are supported by the `grep` found on most modern Unix variants. Add `-F` (or use `fgrep`) to suppress regular expressions. Add `-x` (for exact) to match only whole lines. – Jonathan Leffler Jul 22 '15 at 14:20
Why should we take `comm` for sorted files ? – Ulysse BN Apr 24 '17 at 03:23
2

@UlysseBN `comm` can work with arbitrarily large files as long as they are sorted because it only ever needs to hold three lines in memory (I'm guessing GNU `comm` would even know to keep just a prefix if the lines are really long). The `grep` solution needs to keep all the search expressions in memory. – tripleee Dec 08 '17 at 05:44

score 13 · Answer 6 · answered Jul 21 '17 at 11:14

13

If the two files are not sorted yet, you can use:

comm -12 <(sort a.txt) <(sort b.txt)

and it will work, avoiding the error message comm: file 2 is not in sorted order when doing comm -12 a.txt b.txt.

answered Jul 21 '17 at 11:14

Basj

41,386
99
383
673

You're right, but this is essentially repeating another [answer](https://stackoverflow.com/a/24851202/), which really doesn't provide any benefit. If you decide to answer an older question that has well established and correct answers, adding a new answer late in the day may not get you any credit. If you have some distinctive new information, or you're convinced the other answers are all wrong, by all means add a new answer, but 'yet another answer' giving the same basic information a long time after the question was asked usually won't earn you much credit. – Jonathan Leffler Sep 21 '17 at 06:47
I didn't even see this answer @JonathanLeffler because this part was at the very end of the answer, mixed with other elements of answer before. While the other answer is more precise, the benefit of mine I think is that for someone who wants for a quick solution will only have 2 lines to read. Sometimes we're looking for detailed answer and sometimes we are in a hurry and a quick-to-read ready-to-paste answer is fine. – Basj Sep 21 '17 at 10:28
Also I don't care about credit / rep, I didn't post for this purpose. – Basj Sep 21 '17 at 10:35
2

Notice also that the process substitution syntax `<(command)` is not portable to POSIX shell, though it works in Bash and some others. – tripleee Dec 08 '17 at 05:37

score 10 · Answer 7 · edited Jul 17 '13 at 15:22

10

perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/'  file1 file2

edited Jul 17 '13 at 15:22

Sam I am says Reinstate Monica

30,851
12
72
100

answered Jul 17 '13 at 15:05

user2592005

101
1
2

1

this is working better than the `comm` command as it searches each line of `file1` in `file2` where `comm` will only compare if line `n` in `file1` is equal to line `n` in `file2`. – teriiehina Oct 11 '14 at 12:32
1

@teriiehina: No; `comm` does not simply compare line N in file1 with line N in file2. It can perfectly well manage a series of lines inserted in either file (which is equivalent to deleting a series of lines from the other file, of course). It merely requires the inputs to be in sorted order. – Jonathan Leffler Jul 22 '15 at 14:24
Better than `comm` answers if one wants to keep the order. Better than `awk` answer if one don't want duplicates. – tuxayo Jul 13 '16 at 13:16
An explanation is here: https://stackoverflow.com/questions/17552789/explain-this-perl-code-which-displays-common-lines-in-2-files – Chris Koknat Aug 25 '17 at 23:18

score 6 · Answer 8 · answered Aug 14 '16 at 10:16

6

awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2

answered Aug 14 '16 at 10:16

R S John

507
2
9
16

This command does not work. – Ahmad Ismail Feb 03 '22 at 07:47

score 3 · Answer 9 · edited Aug 18 '21 at 20:13

On limited version of Linux (like a QNAP (NAS) I was working on):

comm did not exist
grep -f file1 file2 can cause some problems as said by @ChristopherSchultz and using grep -F -f file1 file2 was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)

So here is what I did:

sort file1 > file1.sorted
sort file2 > file2.sorted

diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted

If files.same.sorted shall be in the same order as the original ones, then add this line for same order than file1:

awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same

Or, for the same order than file2:

awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same

score 2 · Answer 10 · edited Aug 18 '21 at 20:09

For how to do this for multiple files, see the linked answer to Finding matching lines across many files.

Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:

#!/bin/bash
ans="matching_lines"

for file1 in *
do 
    for file2 in *
        do 
            if  [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
                echo "Comparing: $file1 $file2 ..." >> $ans
                perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/' $file1 $file2 >> $ans
            fi
         done 
done

Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.

Things to be improved:

Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string

score 0 · Answer 11 · answered Mar 07 '22 at 10:33

Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario

If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:

cat a_bunch_of_files* | sort | uniq | wc

If the number of lines you get is less than the one you get from

cat a_bunch_of_files* | wc

then there is some repeated line.

score -2 · Answer 12 · edited Jul 22 '15 at 14:38

-2

rm file3.txt

cat file1.out | while read line1
do
        cat file2.out | while read line2
        do
                if [[ $line1 == $line2 ]]; then
                        echo $line1 >>file3.out
                fi
        done
done

This should do it.

edited Jul 22 '15 at 14:38

Jonathan Leffler

730,956
141
904
1,278

answered Sep 01 '13 at 09:34

Alan Joseph

15
1

1

You should probably use `rm -f file3.txt` if you're going to delete the file; that won't report any error if the file doesn't exist. OTOH, it would not be necessary if your script simply echoed to standard output, letting the user of the script choose where the output should go. Ultimately, you'd probably want to use `$1` and `$2` (command line arguments) instead of fixed file names (`file1.out` and `file2.out`). That leaves the algorithm: it is going to be slow. It is going to read `file2.out` once for each line in `file1.out`. It'll be slow if the files are big (say multiple kilobytes). – Jonathan Leffler Jul 22 '15 at 14:42
1

While this can nominally work if you have inputs which doesn't contain any shell metacharacters (hint: see what warnings you get from http://shellcheck.net/), this naive approach is terribly inefficient. A tool like `grep -F` which reads one file into memory and then does a single pass over the other avoids looping repeatedly over both input files. – tripleee Dec 08 '17 at 05:40

Shell command to find lines common in two files

12 Answers12

Linked

Related