Count lines of code recursively, including compressed (zip) files

Question

I use the following Bash script to count lines of code in one of my projects:

echo "--- CLIENT"
cd "/mypath/client"

# Count classes:
a=`find . -name \*.java -print | wc -l`
echo ""
echo "Number of Java classes: $a"

# Total count:
b=`find . -name \*.java -exec cat {} \; | wc -l`
echo ""
echo "Java lines: $b"

c=`find . -name \*.css -exec cat {} \; | wc -l`
echo ""
echo "CSS lines: $c"

d=`find . -name \*.json -exec cat {} \; | wc -l`
echo ""
echo "JSON lines: $d"

f=$((`find . -name \*.h -exec cat {} \; | wc -l` + `find . -name \*.m -exec cat {} \; | wc -l`))
echo ""
echo "iOS Objective-C lines: $f"

echo ""
echo "--- SERVER"
cd "/mypath/server"
# Count classes:
h=`find . -name \*.java -print | wc -l`
echo ""
echo "Number of Java classes: $h"

# Total count:
i=`find . -name \*.java -exec cat {} \; | wc -l`
echo ""
echo "Java lines: $i"


echo ""
echo "Total lines of code: $((b + c + d + e + f + i))"

cd ~

This script worked fine as long as all the source code was searchable this way. Now I have a different use case: some of the source code is still reachable with this script, and some of it is inside compressed zip files (located in various subfolders of "/mypath/client"). These zip files can contain the sources in the root or in various subfolders within them.

I suppose it's possible to adapt my script to take into account the zipped files in the count, but I don't know how to do it.

To simplify, it would be enough for me to have an answer that only considers how to modify the line "a=`find . -name \*.java -print | wc -l`", everything else would come accordingly. — Francesco Galgani, May 25 '21 at 08:25
You can add zip specific section to your script. Something like `j=\`find . -name \*.zip -exec unzip -l {} \; | grep '\.java$' | wc -l\``. — Zilog80, May 25 '21 at 09:47

Socowi · Accepted Answer · 2021-05-26T16:57:42.247

2

Counting Files

When you search for .xyz files, also search for .zip files and search their file list. You can list all filenames in a zip archive using zipinfo archive.zip. zipinfo also supports wildcards to print only matching filenames. For instance, zipinfo archive.zip '*.java' prints only filenames ending with .java.

find . -name \*.java -print \
    -o -name \*.zip -exec zipinfo -1 {} '*.java' \; |
wc -l

This command assumes that filenames do not contain linebreaks.

Counting Lines

You can print zipped files without explicitly extracting them using unzip -p archive.zip file1 file2 .... This command also accepts wildcards.

By the way: You can drastically simplify your script by using a function, since find . -name \*.xyz -exec cat {} \; | wc -l is often the same, except for xyz. Also, -exec cat {} + is way faster than -exec cat {} \;.

#! /bin/bash

countLines() {
  local ext=$1
  find . -name "*.$ext" -exec cat {} + \
      -o -name \*.zip -exec unzip -p {} "*.$ext" \; |
  wc -l
}

for ext in java css json; do
  echo "$ext lines: $(countLines "$ext")"    
done

unzip -p archive.zip '*.java' may print the warning caution: filename not matched: *.java if there are no .java files. You can suppress this by adding 2> /dev/null after the find command.

Keep in mind that this approach is very inefficient. find has to run for each file extension. And the zip files are read multiple times too. It would be faster to filter out all files that you want to inspect first, then run wc -l on all of them, and then sum up their line counts.

edited May 26 '21 at 16:57

answered May 25 '21 at 09:54

Socowi

25,550
3
32
54

Thanks! Before I accept your answer, I need to do some experimentation and study it, because it seems to give me more numerical results than I expected. My knowledge of Bash is limited to the essentials, I need some time to understand what you wrote. – Francesco Galgani May 25 '21 at 10:13
I have thoroughly and extensively tried your solution. On simple cases, it works. On very complex projects, something is wrong, sometimes the numbers reported are much higher than the real ones, sometimes tens of thousands of lines of code are reported even when there is no code or only a few lines. I have no idea why this happens, something is buggy, but I don't know what. – Francesco Galgani May 26 '21 at 16:02
Solved with a workaround: I unzip everything in a temporary folder with the command indicated in https://stackoverflow.com/a/22384233 and apply the same calculation script indicated in my question. In this way the count is correct, equal to what I expected. – Francesco Galgani May 26 '21 at 16:23
@FrancescoGalgani Thank you for testing. I think I found the problem. `unzip -p` accepts wildcards. Could it be that in the problematic projects one of your zipped source files had a name containing one of `*?[]`? If so, that might have caused more files to be printed. Funnily enough, the fix made the whole program simpler. – Socowi May 26 '21 at 16:44
I have accepted your answer: after your fix, the calculation is now correct even on complex projects. Thank you. – Francesco Galgani May 27 '21 at 11:13

Count lines of code recursively, including compressed (zip) files

1 Answers1

Counting Files

Counting Lines