9

Say, I have two files and want to find out how many equal lines they have. For example, file1 is

1
3
2
4
5
0
10

and file2 contains

3
10
5
64
15

In this case the answer should be 3 (common lines are '3', '10' and '5').

This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:

 cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l

It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.

P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.

UPD: Files do not have duplicate lines

mikhail
  • 517
  • 5
  • 16

9 Answers9

15

To find lines in common with your 2 files, using awk :

awk 'a[$0]++' file1 file2

Will output 3 10 15

Now, just pipe this to wc to get the number of common lines :

awk 'a[$0]++' file1 file2 | wc -l

Will output 3.

Explanation:

Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.

By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.

Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).


Note: this solution counts duplicates, i.e if you have:

file1 | file2
1     | 3
2     | 3
3     | 3

awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3

If this is a behaviour you don't want, you can use the following code to filter out duplicates :

awk '++a[$0] == 2' file1 file2 | wc -l
Aserre
  • 4,916
  • 5
  • 33
  • 56
2

with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:

grep -cFwf file2 file1

with your input files, the above line outputs

3
Kent
  • 189,393
  • 32
  • 233
  • 301
  • Not that intuitive, unfortunately :) I would have to dig up the flags every time I need this command. – mikhail Aug 13 '14 at 10:33
  • @mikhail if you use grep often, you would know `-Fwf` is very common combination. and `-c` is just for printing counts. but it doesn't matter. I think you have enough solution options for your problem. – Kent Aug 13 '14 at 11:24
1

Here's one without awk that instead uses comm:

comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l

comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files. The output is the lines they have in common, on separate lines. wc -l counts the number of lines.

Output without wc -l:

10
3
5

And when counting (obviously):

3
keyser
  • 18,829
  • 16
  • 59
  • 101
  • Nice, but this will first sort the files, which isn't that good for big files. – mikhail Aug 13 '14 at 10:22
  • 1
    Although I do like that you can edit the flags to get, for example, all lines that are in first file, but not in second. Thanks for your answer. – mikhail Aug 13 '14 at 10:26
0

You can do all with awk:

awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2

To get the percentage, something like this works:

awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++;  }  print b, c/n; print FILENAME, c/FNR;}' file1 file2

and outputs

file1 0.428571
file2 0.6

In your solution, you can get rid of one cat:

sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
martin
  • 3,149
  • 1
  • 24
  • 35
0

You can also use comm command. Remember that you will have to first sort the files that you need to compare:

[gc@slave ~]$ sort a > sorted_1
[gc@slave ~]$ sort b > sorted_2
[gc@slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5

From man pages for comm command: comm - compare two sorted files line by line Options:

-1     suppress column 1 (lines unique to FILE1)
-2     suppress column 2 (lines unique to FILE2)
-3     suppress column 3 (lines that appear in both files)
Technext
  • 7,887
  • 9
  • 48
  • 76
0

How about keeping it nice and simple...

This is all that's needed:

cat file1 file2 | sort -n | uniq -d | wc -l
3

man sort: -n, --numeric-sort -- compare according to string numerical value

man uniq: -d, --repeated -- only print duplicate lines

man wc: -l, --lines -- print the newline counts

Hope this helps.

EDIT - one fewer process (credit martin):

sort file1 file2 | uniq -d | wc -l
mattst
  • 13,340
  • 4
  • 31
  • 43
  • Files do not necessarily contain only numbers. Thanks for `-d` option, didn't know about it, but otherwise this is not that different from what I originally suggested and still makes use of 4 programs. – mikhail Aug 13 '14 at 10:30
  • It doesn't matter about whether the lines contain only numbers or not `sort -n` will work anyway. In fact the `-n` is superfluous, I put it in because your example files show only numbers. The purpose of my answer was just to show that you could get rid of the awk code by using `uniq -d` instead of `uniq -c`. My edit above now uses one fewer program. – mattst Aug 13 '14 at 11:41
0

One way using awk:

awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2

Output:

3
John B
  • 3,566
  • 1
  • 16
  • 20
0

The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.

I believe this edit will return only the unique lines that exist in BOTH files.

awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2

If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).

awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2

Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!

greg
  • 1
  • 1
0

UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :

 mawk '
 BEGIN { _*= FS = "^$"
 }    FNR == NF { split("",___)
 } ___[$_]++<NF {          __[$_]++ 
 } END {          split("",___)
     for (_ in __) {
               ___[__[_]]++ }    printf(RS)
     for (_ in ___) {
         printf(" %\04715.f %s\n",_,___[_]) }
                                 printf(RS) }' \
    <( jot - 1  999  3 | mawk '1;1;1;1;1' | shuf ) \
  <( jot - 2 1024  7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )

           3 3
           2 67
           1 413

===========================================

this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :

measuring the frequency of frequencies 

it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.

If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.

 gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) | 

 mawk '
 BEGIN { _*= FS = "^$"
 }     { __[$_]++
 } END {                            printf(RS)
     for (_ in __) { ___[__[_]]++ }
     for (_ in ___) {
         printf(" %\04715.f %s\n",
                        _,___[_]) } printf(RS) }' 
  2 47
  1 386

add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):

  3 9
  2 115
  1 482
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11