5

file.xml is a large 74G file, I have to grep a single regular expression against it as fast as possible. I'm trying to do this by using GNU parallel:

parallel --pipe --block 10M --ungroup LC_ALL=C grep -iF "test.*pattern" < file.xml
  1. How can I implement this by using --pipepart since it's faster than --pipe?

  2. Does it get faster by increasing or decreasing size of blocks (example 20M instead of 10M, or 10M instead of 20M)?

Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • 1
    You'll have to test to be sure, but my expectation is that trying to parallelize it like this will *not* work -- I expect the overhead from having `parallel` read, split, and pipe the file's contents to `grep` will make it slower than having a single `grep` process read the file directly. – Gordon Davisson Jun 01 '20 at 03:32
  • @GordonDavisson my command is a fork from [this one](https://stackoverflow.com/a/9067042/13641553), it's not entirely my idea. – Just Wanna Learn ASM Jun 01 '20 at 03:35
  • In that one, `parallel` isn't processing the bulk data, it's just handing out filenames to `grep` commands and having them read the files directly. In that situation, parallelizing it does make sense. – Gordon Davisson Jun 01 '20 at 03:37
  • 3
    A 74Gigabytes XML file is a creepy combination. Opened tag at the beginning need reading the whole 74G to find its closing. By its nature XML is a serial data format and add a nested structure at that. I really wonder what kind of twist motivated choosing XML for such a large dataset to be searched at that. – Léa Gris Jun 01 '20 at 03:58
  • 1
    In general the disk I/O will be the bottleneck which cannot be accelerated with `parallel`. Please consider to introduce `ripgrep` (`rg`) which has a much better performance than `grep`. – tshiono Jun 01 '20 at 04:01

1 Answers1

1

1.) The largest xml file I have is 11G so YMMV but using parallel --pipepart LC_ALL=C grep -H -n 'searchterm' {} :::: file.xml was faster than parallel --pipe --block 10M --ungroup LC_ALL=C grep -iF "test.*pattern" < file.xml and significantly faster than grep "searchterm" file.xml.

2.) I didn't specify a block size for the parallel --pipepart command above, but you can with the --block option; you'll need to try different block sizes yourself to see whether they speed up / slow down the search. Using --block -1 provided the fastest speed on my system for this approach.

As @tshiono mentioned in the comments, try ripgrep - this was fastest on my test xml file (quicker than grep/parallel grep/anything else) and may prove to be a better solution for you overall.

EDIT I tested @Ole Tange's suggested 'parallel + ripgrep' approach (parallel --pipepart --block -1 LC_ALL=C rg 'Glu299SerfsTer21' {} :::: ClinVarFullRelease_00-latest.xml) and it was ~the same as rg 'Glu299SerfsTer21' ClinVarFullRelease_00-latest.xml on my system. The difference was negligible, so the 'parallel + rg' approach may be best for a very large XML file. There are a number of potential reasons I didn't see the expected speedup, eg @Gordon Davisson suggestions in his comment above, but you would need to conduct comprehensive benchmarking with your own system to figure out the best approach.

(Thanks Ole Tange for the suggestion and for creating such kick ass software)

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • 1
    I think the fastest you can go is `--pipepart --block -1` combined with `ripgrep`. `--block -1` splits the file into roughly evenly sized blocks - one per CPU thread. – Ole Tange Jun 22 '20 at 12:26