6

I need to find null-delimited items from numerous files (data2, data3, ...) that are present in data1. Exact match is required.

All works well with grep -f data1 data2 data3 ... until the items in data1 are also null-delimited.

  1. Using only newlines - ok:

    $ cat data1
    1234
    abcd
    efgh
    5678
    $ cat data2
    1111
    oooo
    abcd
    5678
    $ grep -xFf data1 data2
    abcd
    5678
    
  2. data2 contains null-delimited items - ok when -z used:

    $ printf '1111\0oooo\0abcd\0005678' > data2
    $ grep -zxFf data1 data2 | xargs -0 printf '%s\n'
    abcd
    5678
    
  3. Now both data1 and data2 contain null-delimited items - fail. Seems that the -z option does not apply to the file specified with -f:

    $ printf '1234\0abcd\0efgh\0005678' > data1
    $ grep -zxFf data1 data2 | xargs -0 printf '%s\n'
    
    $
    

The problem is that I do need both files to have null-delimited items. Obvious work-around could be (for example) a good old while loop:

while IFS= read -rd '' line || [[ $line ]]; do
    if grep -zqxF "$line" data2; then
        printf '%s\n' "$line"
    fi
done < data1

But since I have many files with lots of items, this will be painfully slow! Is there a better approach (I do not insist on using grep)?

codeforester
  • 39,467
  • 16
  • 112
  • 140
PesaThe
  • 7,259
  • 1
  • 19
  • 43
  • Do you need to retain ordering? Would a `comm`-style operation work as well as `grep`? (And do you *have* GNU `comm`, with `-z`?) – Charles Duffy Aug 28 '18 at 15:22
  • @CharlesDuffy Ordering is not a problem at all. – PesaThe Aug 28 '18 at 15:22
  • Are you using `grep` because you really need its regular-expression-matching speed, or because you had a problem that started out simple enough to hack together a `grep`-based solution? If the latter, this is the point where you switch to a language with proper data structures, read `data` into memory, and go from there. – chepner Aug 28 '18 at 15:25

2 Answers2

6

Since order retention isn't important, you're trying to match exact strings, and you have GNU tools available, instead of using fgrep I'd suggest comm -z.

$ printf '%s\0' 1111 oooo abcd 005678 >data2
$ printf '%s\0' 1234 abcd efgh 005678 >data
$ comm -z12 <(sort -uz <data) <(sort -uz <data2) | xargs -0 printf '%s\n'
005678
abcd

If you generate your files sorted in the first place (and thus can leave out the sort operations), this will also have very good memory and performance characteristics.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • That would work (+1) but since I have many files, I'd issue `sort` and `comm` many times. Is there a more efficient solution that comes to mind or do you think this will be fast enough? – PesaThe Aug 28 '18 at 15:28
  • 2
    @PesaThe, `comm` is very fast, and you can do the `sort`s once per file, not once per operation, so long as you retain their results. – Charles Duffy Aug 28 '18 at 15:29
  • @PesaThe, ...the other thing is -- what exactly are you trying to accomplish with your many files? If you want to look for items in both file1 and "any of file2, file3 or file4", you can do that trivially with just one `comm` invocation, comparing a sort of file1 with file2, file3 and file4 merged+sorted together. Suggesting a good solution requires knowing more about the problem you're trying to solve. – Charles Duffy Aug 28 '18 at 15:31
  • 1
    Oh, I could just use `comm ... <(sort -uz data2 data3 ...)`. I need a break :) Thanks, that sounds fast enough! – PesaThe Aug 28 '18 at 15:33
  • 1
    ...and if `data2`, `data3`, etc. are individually pre-sorted, you could use `comm ... <(sort -umz data2 data3 ...)` to join them with a merge sort, which will be even faster (and more memory-efficient). – Charles Duffy Aug 28 '18 at 15:37
2

(Although the following may not be the best solution for this particular case, I added it anyway in case it helps some future reader with a similar problem. See below for a gawk solution which may be useful for this use case.)

grep has newline hard-wired as the pattern terminator. Even if you use -e pattern, newlines in the pattern string will cause grep to handle the options as specifying multiple patterns, not a single pattern containing newline characters.

However, if your NUL-separated patterns do not contain newline characters, you can use Gnu xargs and sed to construct an appropriate grep invocation with -e command-line arguments:

sed -z 's/^/-e/' data | xargs -0 grep -zF data2 ...

(This works because Gnu grep reshuffles command-line arguments, so it's ok to put the files to be searched before the patterns. That won't work on many other grep implementations.)

As far as I know, there is no workaround for patterns which may contain newline characters. grep -E and grep -F do not recgnise ascii escape sequences, and will silently create multiple patterns from a pattern which contains a newline. grep -P (another Gnu extension which uses PCRE regexen) will properly handle embedded newline characters or ascii escapes, but only allows a single pattern.


Full-line NUL-terminated matches without sorting

In the case that you are only interested in exact, full "line" matches (-Fx), you could use an Gnu Awk script rather than sorting the inputs and patterns. This could be a win for very large inputs which don't fit in memory; sorting with external temporary files can be quite expensive. The Awk solution uses a hash table, so sorting is unnecessary. (Again, this might not work on all Awks because it relies on setting RS to NUL.)

awk -v RS=`\0` 'NR==FNR{p[$0] = 1; next;} $0 in p' data data2 ...
Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341
  • Interesting, so newline is unfortunately hard-wired. My "patterns" can contain newlines, hence the null delimiter. Anyway, it was a nice read, you probably meant `< <(sed...)` though. Thanks! – PesaThe Aug 28 '18 at 18:29
  • @PesaThe: Oops, that was an artefact from a previous attempt. I meant to just do a simple pipe. Fixed, thanks. But if your patterns can contain newlines, you need a completely different solution; I think `grep` just cannot help you. – rici Aug 28 '18 at 18:30
  • I am not insisting on `grep`. It just looked like a good tool for the job at the time which now I know it is not (thanks to comments/answers). Since the "patterns" are fixed (`-Fx`), the accepted solution works perfectly. Nonetheless, this answer will definitely benefit future readers! :) – PesaThe Aug 28 '18 at 18:34
  • 1
    I've just seen your update and I have to confess I'm a little embarrassed for trying to find some complicated `grep` solution when for fixed string I could've used very simple and intuitive `awk -RS='\0'` ... – PesaThe Aug 28 '18 at 22:44
  • 1
    I am also split now on whether I should change the accepted answer or not. Both answers are brilliant, your answer providing more information however... – PesaThe Aug 28 '18 at 22:48
  • 1
    @PesaThe: Whatever. Neither Charles nor I are hurting for rep :). The grep solution will have it's uses, but not in the case where full-line matching is actually required; I confess, I didn't see the `-x` option in your command lines until just now. – rici Aug 28 '18 at 22:49
  • 1
    For awk, if the data contains a large number of values (lines) that don't match the 'good' list, you can reduce memory bloat (and maybe overflow) by testing `$0 in p` instead of `p[$0]/*!=0*/` – dave_thompson_085 Aug 28 '18 at 23:02
  • @dave: Good point. Fixed. Thanks. (If I weren't in the middle of something I'd fix `NR==FNR`, too, but it's such an idiom... Later.) – rici Aug 28 '18 at 23:47