4

I have 8 files of one column and non uniform number of rows in each column. I need to identify the elements which are common in all of these 8 files.

I can do this task for comparing two files, but I am unable to write workable one liner in shell to do the same.

Any ideas.....

Thank you in advance.

File 1
Paul
pawan

File 2
Raman
Paul
sweet
barua

File 3
Sweet
barua
Paul

The answer of the comparison of these three files should be Paul.

Angelo
  • 4,829
  • 7
  • 35
  • 56

6 Answers6

8

The following one-liner should do (change 3 to 8 to match your case)

$ sort * | uniq -c | grep 3
      3 Paul

Probably better to do this in python though, using sets...

Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
  • Returns the same if one line is three times in one file and not at all in the others. – eumiro Jan 02 '12 at 12:28
  • 1
    @eumiro, true; **know your input** and select the best method for it. As I said: "python and `set` is probably the best solution" but you posted that one :-) – Fredrik Pihl Jan 02 '12 at 12:34
  • Of course, `grep 3` should be `grep 8` in the case with 8 files. And it should be `grep "^ *8 "` to omit lines that have numbers in them. And an additional `| sed -e "s/^ *8 //"` removes the superfluous count at the beginning of the result. – daniel kullmann Jan 02 '12 at 15:00
6
python -c 'import sys;print "".join(sorted(set.intersection(*[set(open(a).readlines()) for a in sys.argv[1:]])))' File1 File2 File3

prints Paul for your files File1, File2 and File3.

eumiro
  • 207,213
  • 34
  • 299
  • 261
  • 1
    Totally, ripped my answer i was working on. :D Btw, `"\n".join` should be better, IMO. and no need to sort since the set will be sorted. – st0le Jan 02 '12 at 12:24
  • @st0le - `"\n".join` inserts extra newlines, while `.readlines` keeps them within the strings, so you don't need the extra `"\n"`. And `set` is not automatically sorted. – eumiro Jan 02 '12 at 12:26
  • 2
    `import sys;print"".join(reduce(set.intersection, map(set, map(open, sys.argv[1:]))))` – jfs Jan 02 '12 at 13:17
  • how would you edit this so the outputted file would be: Paul 3 Sweet 2 barua 2 Ramen 1 so basically a list of all the strings and how many files they are common in, possible sorted by to 10 – Jack Antony Park Jul 08 '20 at 14:00
4

Perl

$ perl -lnE '$c{$_}{$ARGV}++ }{ print for grep { keys %{$c{$_}} == 8 } keys %c;' file[1-8]

It should be possible to get rid of the hard-coded 8 as well with @{[ glob "@ARGV" ]} but I don't have time to test it now.

This solution will correctly handle the existence of duplicate lines across files as well.

Zaid
  • 36,680
  • 16
  • 86
  • 155
3

Here I've been trying to find a concise way to make sure each match comes from a different file. If there are no duplicates within the files, it's fairly simple in perl:

perl -lnwE '$a{$_}++; END { for (keys %a) { print if $a{$_} == 3 } }' files*

The -l option will auto-chomp your input (remove newline), and add a newline to the print. This is important in case of missing eof newlines.

The -n option will read input from file name arguments (or stdin).

The hash assignment will count duplicates, and the END block will print out what duplicates appeared 3 times. Change 3 to however many files you have.

If you want a slightly more flexible version, you can count the arguments in a BEGIN block.

perl -lnwE 'BEGIN { $n = scalar @ARGV } 
    $a{$_}++; END { for (keys %a) { print if $a{$_} == $n } }' files*
TLP
  • 66,756
  • 10
  • 92
  • 149
  • No need for the `BEGIN` block. Just replace `$n` with `@ARGV` or `0+@ARGV` – Zaid Jan 02 '12 at 14:12
  • @TLP: I checked your first one liner with duplicates as well in two files, it works perfectly or am I missing something? – Angelo May 29 '12 at 15:09
2
$ awk '++a[$0]==3' file{1..3}.txt
Paul

update

$ awk '(FILENAME SEP $0) in b{next}; b[FILENAME,$0]=1 && ++a[$0]==3' file{1..3}.txt
Paul
kev
  • 155,172
  • 47
  • 273
  • 272
2

This might work for you:

ls file{1..3} | 
xargs -n1 sort -u | 
sort | 
uniq -c | 
sed 's/^\s*'"$(ls file{1..3} | wc -l)"'\s*//p;d'
potong
  • 55,640
  • 6
  • 51
  • 83