2

I have some 5 million text files under a directory - all of the same format (nothing special, just plain text files with some integers in each line). I would like to compute the maximum and minimum line count amongst all these files, along with its two filenames (the one for max and another for min).

I started out by trying to write out all the line count like so (and then workout how to find the min and max from this list):

wc -l `find /some/data/dir/with/text/files/ -type f` > report.txt

but this throws me an error:

bash: /usr/bin/wc: Argument list too long

Perhaps there is a better way to go about this?

thanasisp
  • 5,855
  • 3
  • 14
  • 31
JasonB
  • 555
  • 2
  • 7
  • 13
  • 3
    Note that using `$(find ...)` or `$(ls ...)` in a command line is a bad practice in general -- see [BashPitfalls #1](http://mywiki.wooledge.org/BashPitfalls#for_f_in_.24.28ls_.2A.mp3.29), and [UsingFind](https://mywiki.wooledge.org/UsingFind) describing what to do instead. – Charles Duffy Sep 28 '20 at 22:40
  • 1
    ...btw, given that you have millions of files, I'd consider filtering on byte count before running the line counts; unless you're prone to having wild outliers in terms of line length, it'd be a lot more efficient to only scan the 100,000 largest and smallest files (as measured in bytes, which is a constant-time operation to measure rather than one that scales with size) for length in lines. – Charles Duffy Sep 28 '20 at 22:48

1 Answers1

3

There is a limit to the argument list length. Since you have several millions files passed to wc, the command certainly crossed this line.

Better invoke find -exec COMMAND instead:

find /some/data/dir/with/text/files/ -type f -exec wc -l {} + > report.txt

Here, each found file find will be appended to the argument list of the command following -exec in place of {}. Before the argument length is reached, the command is run and the remaining found files will be processed in a new run of the command the same way, until the whole list is done.

See man page of find for more details.


Thanks to Charles Duffy for the improvements of this answer.

Amessihel
  • 5,891
  • 3
  • 16
  • 40
  • 2
    I can't speak to the downvotes, but I _would_ strongly suggest changing `-exec ... {} \;` to `-exec ... {} +`; as it is, this is wildly inefficient (starting a new copy of `wc` per file, instead of starting a new one only when the argument list would otherwise get too long). – Charles Duffy Sep 28 '20 at 22:38
  • 1
    BTW, while this takes us out of POSIX-compliance into GNUisms, one might also think of switching from `-exec` to `-execdir` (unless the files are organized in lots of small subdirectories); running `wc` in the individual directories means the directory names don't need to be passed on the command line, increasing the number of individual files that can be passed to each copy of `wc`. – Charles Duffy Sep 28 '20 at 22:44
  • @CharlesDuffy Is it a GNUism? `find /some/data/dir/with/text/files/ -type f -print0 | xargs -0 wc -l >report.txt` – Léa Gris Sep 28 '20 at 23:08
  • 1
    @LéaGris, [apparently not](https://unix.stackexchange.com/a/320172/138686). (Adopted in [PASC Interpretation 1003.2 #210](https://collaboration.opengroup.org/external/pasc.org/interpretations/unofficial/db/p1003.2/pasc-1003.2-210.html) in 2001.) – Amessihel Sep 28 '20 at 23:18