Identifying common elements in multiple files

Question

I have 8 files of one column and non uniform number of rows in each column. I need to identify the elements which are common in all of these 8 files.

I can do this task for comparing two files, but I am unable to write workable one liner in shell to do the same.

Any ideas.....

Thank you in advance.

File 1
Paul
pawan

File 2
Raman
Paul
sweet
barua

File 3
Sweet
barua
Paul

The answer of the comparison of these three files should be Paul.

how large are the files? Can you keep them all in memory at the same time? — Øyvind Skaar, Jan 02 '12 at 12:24
Can there be duplicate words in the same file? I.e. causing false matches if all the file data is concatenated and counted? — TLP, Jan 02 '12 at 13:01
@ Zaid: I failed with multiple files and ran out of options. — Angelo, Jan 02 '12 at 13:12
@TLP: No repetition of elements in each file all are unique elements within a file. — Angelo, Jan 02 '12 at 13:14

score 8 · Answer 1 · answered Jan 02 '12 at 12:21

8

The following one-liner should do (change 3 to 8 to match your case)

$ sort * | uniq -c | grep 3
      3 Paul

Probably better to do this in python though, using sets...

answered Jan 02 '12 at 12:21

Fredrik Pihl

44,604
7
83
130

Returns the same if one line is three times in one file and not at all in the others. – eumiro Jan 02 '12 at 12:28
1

@eumiro, true; **know your input** and select the best method for it. As I said: "python and `set` is probably the best solution" but you posted that one :-) – Fredrik Pihl Jan 02 '12 at 12:34
Of course, `grep 3` should be `grep 8` in the case with 8 files. And it should be `grep "^ *8 "` to omit lines that have numbers in them. And an additional `| sed -e "s/^ *8 //"` removes the superfluous count at the beginning of the result. – daniel kullmann Jan 02 '12 at 15:00

score 6 · Accepted Answer · answered Jan 02 '12 at 12:22

6

python -c 'import sys;print "".join(sorted(set.intersection(*[set(open(a).readlines()) for a in sys.argv[1:]])))' File1 File2 File3

prints Paul for your files File1, File2 and File3.

answered Jan 02 '12 at 12:22

eumiro

207,213
34
299
261

1

Totally, ripped my answer i was working on. :D Btw, `"\n".join` should be better, IMO. and no need to sort since the set will be sorted. – st0le Jan 02 '12 at 12:24
@st0le - `"\n".join` inserts extra newlines, while `.readlines` keeps them within the strings, so you don't need the extra `"\n"`. And `set` is not automatically sorted. – eumiro Jan 02 '12 at 12:26
2

`import sys;print"".join(reduce(set.intersection, map(set, map(open, sys.argv[1:]))))` – jfs Jan 02 '12 at 13:17
how would you edit this so the outputted file would be: Paul 3 Sweet 2 barua 2 Ramen 1 so basically a list of all the strings and how many files they are common in, possible sorted by to 10 – Jack Antony Park Jul 08 '20 at 14:00

Zaid · Answer 3 · 2012-01-02T15:25:41.263

4

Perl

$ perl -lnE '$c{$_}{$ARGV}++ }{ print for grep { keys %{$c{$_}} == 8 } keys %c;' file[1-8]

It should be possible to get rid of the hard-coded 8 as well with @{[ glob "@ARGV" ]} but I don't have time to test it now.

This solution will correctly handle the existence of duplicate lines across files as well.

edited Jan 02 '12 at 15:25

answered Jan 02 '12 at 13:49

Zaid

36,680
16
86
155

You can use a BEGIN block to count the files. – TLP Jan 02 '12 at 14:01
Awesome. Now, how do I do that for a BEGIN block? =) – TLP Jan 02 '12 at 14:19
Now it's working. Though I think you should add `-l` to both perform chomp, and do away with 5.10 reliance for `say`. – TLP Jan 02 '12 at 14:57
@TLP : I like your suggestion. Nice to see it works across the `END` block as well. – Zaid Jan 02 '12 at 15:25

TLP · Answer 4 · 2012-01-02T13:58:24.310

Here I've been trying to find a concise way to make sure each match comes from a different file. If there are no duplicates within the files, it's fairly simple in perl:

perl -lnwE '$a{$_}++; END { for (keys %a) { print if $a{$_} == 3 } }' files*

The -l option will auto-chomp your input (remove newline), and add a newline to the print. This is important in case of missing eof newlines.

The -n option will read input from file name arguments (or stdin).

The hash assignment will count duplicates, and the END block will print out what duplicates appeared 3 times. Change 3 to however many files you have.

If you want a slightly more flexible version, you can count the arguments in a BEGIN block.

perl -lnwE 'BEGIN { $n = scalar @ARGV } 
    $a{$_}++; END { for (keys %a) { print if $a{$_} == $n } }' files*

No need for the `BEGIN` block. Just replace `$n` with `@ARGV` or `0+@ARGV` — Zaid, Jan 02 '12 at 14:12
@TLP: I checked your first one liner with duplicates as well in two files, it works perfectly or am I missing something? — Angelo, May 29 '12 at 15:09

kev · Answer 5 · 2012-01-02T12:49:49.447

2

$ awk '++a[$0]==3' file{1..3}.txt
Paul

update

$ awk '(FILENAME SEP $0) in b{next}; b[FILENAME,$0]=1 && ++a[$0]==3' file{1..3}.txt
Paul

edited Jan 02 '12 at 12:49

answered Jan 02 '12 at 12:25

kev

155,172
47
273
272

Returns the same if one line is three times in one file and not at all in the others. – eumiro Jan 02 '12 at 12:28

potong · Answer 6 · 2012-01-02T15:28:30.797

2

This might work for you:

ls file{1..3} | 
xargs -n1 sort -u | 
sort | 
uniq -c | 
sed 's/^\s*'"$(ls file{1..3} | wc -l)"'\s*//p;d'

edited Jan 02 '12 at 15:28

answered Jan 02 '12 at 15:22

potong

55,640
6
51
83

Identifying common elements in multiple files

6 Answers6

Perl

update

Linked