awk to count accumulate lines counts with files with spaces

Question

I use this find/wc/awk to identify code files sizes with a final sum:

$ cat Makefile 
qa:
    @find ./ -type f -name '*.py' -exec  \
        wc -l "{}" \; | sort -n| awk     \
        '{printf "%4s %s\n", $$1, $$2}{s+=$$0}END{print s}'
    @echo ''

$

If there is no space in the filenames it works well:

 $ make qa | tail
 545 ./vendored/version.py
 550 ./types.py
 567 ./interchange/from_dataframe.py
 702 ./compute.py
 716 ./vendored/docscrape.py
1003 ./dataset.py
1267 ./pandas_compat.py
3686 ./parquet/core.py
14347

In case one of the files have a space in name it does not work any more

$ mv parquet/core.py "parquet/co  re.py"
$ ls -la parquet/co*py
-rw-rw-r-- 1 luis luis 139281 juil. 19 20:30 'parquet/co  re.py'
(data) luis@spinoza:/tmp/pyarrow$ make qa | tail
 545 ./vendored/version.py
 550 ./types.py
 567 ./interchange/from_dataframe.py
 702 ./compute.py
 716 ./vendored/docscrape.py
1003 ./dataset.py
1267 ./pandas_compat.py
3686 ./parquet/co                   <== Pb !
14347
$

I try to protect with " the $$1, e.g. "$$1" with no success

it would help if you provided the complete set of input being fed to `awk` so that we can run/verify an `awk` solution in our own env (and generate the same output you're seeing) — markp-fuso, Jul 19 '23 at 18:53
FWIW, `awk` isn't needed here. `wc` can do this on its own: `find ./ -type f -name '*.py' -exec wc -l {} + ` — Brian61354270, Jul 19 '23 at 18:56
You can't reliably use `wc` like that since `find` isn't guaranteed to only call it once for all files together, it might call `wc` multiple times for different groups of files it finds. — Ed Morton, Jul 20 '23 at 13:13

score 3 · Answer 1 · answered Jul 19 '23 at 18:57

Change this:

printf "%4s %s\n", $$1, $$2

to this:

size=$$1; name=$$0; sub(/^[[:space:]]*[^[:space:]]+[[:space:]]+/,"",name); printf "%4s %s\n", size, name

That'll work as long as your file names don't contain newlines. If they can then you need a different solution starting with the find.

Brian61354270 · Answer 2 · 2023-07-19T19:26:55.373

3

Provided that the number of files is only around one hundred thousand^*, there's no need to use awk. wc can compute the sum on its own:

find ./ -type f -name '*.py' -print0 | xargs -0 wc -l

which will output something like

  ...
  545 ./vendored/version.py
  550 ./types.py
  567 ./interchange/from_dataframe.py
  702 ./compute.py
  716 ./vendored/docscrape.py
 1003 ./dataset.py
 1267 ./pandas_compat.py
 3686 ./parquet/core.py
14347 total

^*The number of files must be less than the maximum number of command line arguments supported in your environment. Check xargs -r --show-limit to see what the limits on your system are.

edited Jul 19 '23 at 19:26

answered Jul 19 '23 at 19:02

Brian61354270

8,690
4
21
43

`find` would call `wc` passing it batches of file names, not necessarily all of the file names at once, and so YMMV with the output - it could have a bunch of partial sums and no final total. As [the man page](https://man7.org/linux/man-pages/man1/find.1.html) says about invocations of `wc` or any iother tool in that context - "the total number of invocations of the command will be much less than the number of matched files." - it doesn't say the total number of invocations will be 1. – Ed Morton Jul 19 '23 at 19:03
@EdMorton Good point. I've edited the answer to use `xargs`, which at least states that it will build invocations up to the system limit, and added a comment on how to check the system limits. – Brian61354270 Jul 19 '23 at 19:23
2

find and xargs work the same way and so have the same caveat. It's not the number of arguments that's the issue, it's the length of the concatenation of the arguments. So they might call `wc` twice given just 2 long file names if they combined exceed ARG_MAX. Personally, I just would never rely on either of them only calling the command once for a task like this. – Ed Morton Jul 19 '23 at 19:33

awk to count accumulate lines counts with files with spaces

2 Answers2