Percentage of matching lines across multiple files

Question

I am in the need of finding common lines across multiple files; more than 100 files with millions of lines each. Similar to this: Shell: Find Matching Lines Across Many Files.

However, I would like to find not only shared lines across all files but also those lines that are found in all files except one, all files except two and so on. I am interested in using percentages to do so. For example, which entries show up in 90% of the files, 80%, 70% and so on. As an example:

File1

lineA
lineB
lineC

File2

lineB
lineC
lineD

File3

lineC
lineE
lineF

Hypothetical output for the sake of demonstration:

<lineC> is found in 3 out of 3 files (100.00%)

<lineB> is found in 2 out of 3 files (66.67%)

<lineF> is found in 1 out of 3 files (33.33%)

Does anyone know how to do it?

Thank you very much!

Please, show some example file contents of multiple files with expected output. — James Brown, Feb 22 '18 at 16:53
Please, read how to [ask good questions](https://stackoverflow.com/help/mcve). — LMC, Feb 22 '18 at 17:09
what you are showing won't do what you ask. you mentioned 1 million lines of text. there i imagine you won't need to show the stats for all the lines. also, running that scale of input, a naive algorithm should be very slow. are you asking for practical algorithms as well? — Jason Hu, Feb 22 '18 at 17:38
Why show a series of lines as the desired output and then say "I would prefer a table" instead of simply showing a table as the desired output? — Ed Morton, Feb 23 '18 at 12:29
Thanks for your comments, I will try to be more clear in the future. The answer shown below is working. The speed of the algorithm seems quite reasonable according to my data. — Santiago Montero-Mendieta, Feb 23 '18 at 13:49

score 2 · Answer 1 · answered Feb 22 '18 at 17:32

With GNU awk for its multidimensional arrays:

gawk '
    BEGIN {nfiles = ARGC-1}
    { lines[$0][FILENAME] = 1 }
    END {
        for (line in lines) {
            n = length(lines[line])
            printf "<%s> is found in %d of %d files (%.2f%%)\n", line, n, nfiles, 100*n/nfiles
        }
    }
' file{1,2,3}

<lineA> is found in 1 of 3 files (33.33%)
<lineB> is found in 2 of 3 files (66.67%)
<lineC> is found in 3 of 3 files (100.00%)
<lineD> is found in 1 of 3 files (33.33%)
<lineE> is found in 1 of 3 files (33.33%)
<lineF> is found in 1 of 3 files (33.33%)

The order of output is indeterminate

Thanks, Glenn! This works fine and it's fast enough for my needs. — Santiago Montero-Mendieta, Feb 23 '18 at 13:43

Percentage of matching lines across multiple files

1 Answers1