4

I am Trying to count the lines in all the files in a very large folder under Ubuntu.

The files are .gz files and I use

zcat * | wc -l

to count all the lines in all the files, and it's slow!

I want to use multi core computing for this task and found this about Gnu parallel,

I tried to use this bash command:

parallel zcat * | parallel --pipe wc -l

and the cores are not all working I found that the job starting might cause major overhead and tried using batching with

parallel -X zcat * | parallel --pipe -X wc -l

without improvenemt,

how can I use all the cores to count the lines in all the files in a folder given they are all .gz files and need to be decompresses before counting the rows (don't need to keep them uncompressed after)

Thanks!

thebeancounter
  • 4,261
  • 8
  • 61
  • 109
  • I don't know much about parallel, though from this link, https://drjohnstechtalk.com/blog/2011/06/gnu-parallel-really-helps-with-zcat/, you should use a command more like l `ls *gz|time parallel -k "zcat {}" | wc -l` – Adonis Jun 21 '17 at 11:49
  • thanks. seems to utilize the cores better, need to check the result, could you explain a bit more about the syntax and create an answer? – thebeancounter Jun 21 '17 at 11:54
  • Basically `parallel` expects an input, so you list files, pipe it to parallel, ask parallel to keep the order, which might not be necessary in this case (`-k`), use parallel with zcat on one line of the list `"zcat {}"` and then pipe the whole to `wc`. By the way there is a typo in the command above, the `time` command should be removed. I will write a more explicit answer tonight (after some testing on my side) if no one beats me to it! – Adonis Jun 21 '17 at 11:58
  • Thanks! waiting for it. – thebeancounter Jun 21 '17 at 12:00
  • finished testing it now, it gave a total number of lines in all the files, this is not what i needed, i needed a list of the number of lines in each file, i obtained it based on you answer and using ls dest/* |time parallel -k "zcat {} | wc -l" – thebeancounter Jun 21 '17 at 13:11
  • check the post by by Nicholas Sushkin here: https://stackoverflow.com/questions/12716570/count-lines-in-large-files – papkass Jun 21 '17 at 13:21
  • What do you mean by a *"large folder"*? Do you mean there are millions of files? Or that there are a few files each many hundred GB? – Mark Setchell Jun 21 '17 at 15:12
  • arount 150k files that are long files (8mb when .gz and around 150mb when decompressed) a part of the issue is that the list is too long for ls – thebeancounter Jun 21 '17 at 15:14
  • Is this on a normal, single spinning disk, or some type of SSD that can sustain multiple readers? I presume your files are 8MB (8 megabytes) rather than 8mb (8 millibits). – Mark Setchell Jun 21 '17 at 20:01

2 Answers2

3

If you have 150,000 files, you will likely get problems with "argument list too long". You can avoid that like this:

find . -name \*gz -maxdepth 1 -print0 | parallel -0 ...

If you want the name beside the line count, you will have to echo it yourself, since your wc process will only be reading from its stdin and won't know the filename:

find ... | parallel -0 'echo {} $(zcat {} | wc -l)'

Next, we come to efficiency and it will depend on what your disks are capable of. Maybe try with parallel -j2 then parallel -j4 and see what works on your system.


As Ole helpfully points out in the comments, you can avoid having to output the name of the file whose lines are being counted by using GNU Parallel's --tag option to tag output line, so this is even more efficient:

find ... | parallel -0 --tag 'zcat {} | wc -l'
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
0

Basically the command you are looking for is:

ls *gz | parallel 'zcat {} | wc -l'

What it does is:

  • ls *gzlist all gz files on stdout
  • Pipe it to parallel
  • Spawn subshells with parallel
  • Run in said subshells the command inside quotes 'zcat {} | wc -l'

About the '{}', according to the manual:

This replacement string will be replaced by a full line read from the input source

So each line piped to parallel get fed to zcat.

Of course this is basic, I assume it could be tuned, the documentation and examples might help

Adonis
  • 4,670
  • 3
  • 37
  • 57