Print only the lines which are existing in all the input files

Question

Print only the lines which are existing in all the four given input files. from the below shown input files only /dev/dev_sg2 and /dev/dev_sg3 are existing on all the input files

$ cat file1
/dev/dev_sg1
/dev/dev_sg2
/dev/dev_sg3
/dev/dev_sg4

$ cat file2
/dev/dev_sg8
/dev/dev_sg2
/dev/dev_sg3
/dev/dev_sg6

$ cat file3
/dev/dev_sg5
/dev/dev_sg2
/dev/dev_sg3
/dev/dev_sg6

$ cat file4
/dev/dev_sg2
/dev/dev_sg3
/dev/dev_sg1
/dev/dev_sg4

Tried tools:-

cat file* | sort |uniq -c

      1 /dev/dev_sg1
      4 /dev/dev_sg2
      4 /dev/dev_sg3
      1 /dev/dev_sg4
      1 /dev/dev_sg5
      2 /dev/dev_sg6
      1 /dev/dev_sg8

Possible duplicate of [Finding common value across multiple files containing single column values](https://stackoverflow.com/questions/43472246/finding-common-value-across-multiple-files-containing-single-column-values) — Sundeep, Jan 02 '18 at 07:45

score 1 · Answer 1 · answered Jan 02 '18 at 06:43

1

With comm pipeline:

comm -12 <(sort file1) <(sort file2) | comm -12 - <(sort file3) | comm -12 - <(sort file4)

-12 - suppress lines unique to both input file, print only common lines

The output:

/dev/dev_sg2
/dev/dev_sg3

answered Jan 02 '18 at 06:43

RomanPerekhrest

88,541
4
65
105

RavinderSingh13 · Accepted Answer · 2018-01-02T06:47:51.407

Following awk code may help you in same.

awk 'FNR==NR{a[$0];next} ($0 in a){++c[$0]} END{for(i in c){if(c[i]==3){print i,c[i]+1}}}' Input_file1 Input_file2 Input_file3 Input_file4

Output will be as follows.

/dev/dev_sg2 4
/dev/dev_sg3 4

EDIT: In case you don't want to have the count of the lines and simply want to print the lines which come in all 4 Input_files then following will do the trick:

awk 'FNR==NR{a[$0];next} ($0 in a){++c[$0]} END{for(i in c){if(c[i]==3){print i}}}'  Input_file1 Input_file2 Input_file3 Input_file4

EDIT2: Adding explanation for code too now.

awk '
FNR==NR{ ##FNR==NR condition will be TRUE when very first Input_file here Input_file1 is being read.
 a[$0];  ##creating an array named a whose index is current line $0.
 next    ##next is awk out of the box keyword which will avoid the cursor to go forward and will skip all next statements.
}
($0 in a){ ##These statements will be executed when awk complete reading the first Input_file named Input_file1 name here. Checking here is $0 is in array a.
 ++c[$0]   ##If above condition is TRUE then make an increment in array named c value whose index is current line.
}
END{       ##Starting END block of awk code here.
for(i in c){##Initiating a for loop here by which we will iterate in array c.
 if(c[i]==3){ ##checking condition here if array c value is equal to 3, which means it appeared in all 4 Input_file(s).
   print i    ##if, yes then printing the value of i which is actually having the line which is appearing in all 4 Input_file(s).
}
}}
' Input_file1 Input_file2 Input_file3 Input_file4 ##Mentioning all the 4 Input_file(s) here.

score 0 · Answer 3 · answered Jan 02 '18 at 08:07

If you know beforehand that there won't be more than 4 input files, you could simply add grep at end of your existing solution, like this :

cat file* | sort |uniq -c | egrep '^4'

This will show only lines that have max (4) number of counts at start of line.

If you need this to work for arbitrary number of files, a better solution is needed.

score 0 · Answer 4 · answered Jan 02 '18 at 15:50

0

if the order doesn't need to be maintained

$ j() { join <(sort $1) <(sort $2); }; j <(j file1 file2) <(j file3 file4)

/dev/dev_sg2
/dev/dev_sg3

answered Jan 02 '18 at 15:50

karakfa

66,216
7
41
56

Print only the lines which are existing in all the input files

4 Answers4