1

I am trying to count the number of files and lines of every .xls in directory where I executed the script.

total_files=0
total_lines=0

find . -type f -name "*.xls" | while read FILE; do

 count=$(grep -c ^ < "$FILE") #get number of lines in particular file
 total_lines=$(($total_lines+$count));
 ((total_files++))

done

echo "Total files: $total_files"
echo "Total lines: $total_lines"

But I just get 0 in totals every time.

I know that it is possibly because of while loop is executed in a sub shell, but using <<< to execute in the current shell wont help:

done <<< "$(find . -type f -name "*.xls")"
干猕猴桃
  • 175
  • 3
  • 14
  • 3
    `*.xls`, is that an Excel file? What do you mean by the number of lines in an Excel file? – Dominique Jul 12 '21 at 11:49
  • Put a shebang on your script paste your script at https://shellcheck.net for validation/suggestions – Jetchisel Jul 12 '21 at 12:02
  • 2
    Run `file foo.xls` on one of your files. If the output doesn't tell you it's a text file then don't bother trying to run text processing tools on it. Similarly look at the output of `head foo.xls` and decide if that looks like plain text to you. – Ed Morton Jul 12 '21 at 12:02
  • @Dominique Yes, Exel file. I mean number of rows* – 干猕猴桃 Jul 12 '21 at 12:12
  • 2
    An Excel file isn't a POSIX text file so you can't run a tool like grep that runs on text files on it and expect to get meaningful output. You'll need to find something that understands Excel files to tell you how many rows it contains or export all of your Excel files to CSV and then run Unix tools on those. See https://stackoverflow.com/q/38805123/1745001 and/or https://unix.stackexchange.com/q/23726/133219 for possible tools/approaches. – Ed Morton Jul 12 '21 at 12:18
  • @EdMorton Thanks for advice. But `count=$(grep -c ^ < "$FILE") ` works for me, or is it output not reliable? – 干猕猴桃 Jul 12 '21 at 12:20
  • 1
    Yes, its output is not reliable, it's shocking if you got correct output from that at any time. I just created a XLS with 3 rows and `grep -c ^ < "$D/tst.xls"` tells me it contains 11,874 rows. – Ed Morton Jul 12 '21 at 12:24
  • 1
    @EdMorton Wow, thanks, that's really helpful – 干猕猴桃 Jul 12 '21 at 12:27
  • 'Number of lines' is not really a concept for an Excel file. What if the file has multiple sheets? – dawg Jul 12 '21 at 12:48
  • @dawg It should have only 1 sheet each in my case. But yes, its a good point to check – 干猕猴桃 Jul 12 '21 at 13:04

4 Answers4

1

Unix line count methods strictly work on TEXT files, including (mostly) csv files.

Excel files are not text. Excel files are either .xls type binary file used up to 2007 or the .xlsx open xml style file used since. Neither are line oriented.

The only reliable way to do this is use a scripting language that can parse Excel files and can walk a file tree.

Python, Ruby, Perl can all do that on Unix.

Here is a Python with Pandas to walk a file tree, read the file, and count the rows of the default sheet of that file:

# Note: You may need to use pip to install
#       pandas
#       xldd
#       openpyxl

import pandas as pd
from pathlib import Path 

p=Path('ur_root_path')

file_count=0
line_count=0
for pn in p.glob('**/*.xls*'):
    try:
        df=pd.read_excel(pn,header=None)
        print(f'{pn}\n{df}')
        file_count+=1
        line_count+=df.shape[0]
    except ValueError as e:
        print(f'{pn}: {e}')
    
print(f'files={file_count}, total lines={line_count}')  

Given this folder:

$ ls -1 /tmp/test
Book1.xlsx          # 3 rows in col 'A' -- 1-3
Book2.xls           # 6 rows in col 'C' -- 1-6

Running that script produces:

/tmp/test/Book2.xls
    0   1  2
0 NaN NaN  1
1 NaN NaN  2
2 NaN NaN  3
3 NaN NaN  4
4 NaN NaN  5
5 NaN NaN  6
/tmp/test/Book1.xlsx
   0
0  1
1  2
2  3
files=2, total lines=9
dawg
  • 98,345
  • 23
  • 131
  • 206
0

Works just fine:

#!/usr/bin/env bash

total_files=0
total_lines=0

while IFS= read -r file; do
  count="$(wc -l < "$file")"
  (( total_lines+=count ))
  (( total_files++ ))
  echo "$count"
done <<< "$(find . -type f -name '*.xls')"
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 2
    To be clear, though that would work for a POSIX text file, it won't work for a `.xls` file since that isn't a POSIX text file. See the comments under the question. – Ed Morton Jul 12 '21 at 13:10
0

awk is a better tool for this task because it can do both counts natively in one run:

find . -type f -iname '*.xls' -exec awk \
 'END{printf("Number of files: %d\nTotal number of lines: %d\n", ARGC, NR)}' {} \+
Léa Gris
  • 17,497
  • 4
  • 32
  • 41
  • 1
    That'd still fail since the input isn't a text file, you can't rely on `+` giving `awk` ALL of the files at once, unlike `;`, you don't need to escape the `+`, and the number of input files is `ARGC-1` since the first arg is the name of the interpreter. – Ed Morton Jul 12 '21 at 13:04
0

This might work for you:

find . -type f -name "*xls"  2>/dev/null|xargs -n1 wc |sed 's/^/1 /'|numsum -cx1,2

Find files, process each file with wc, prepend a file counter column using sed and tally the first two columns using numsum from num-utils.

The output format is total: files and lines, words and characters may be added by changing the -x command line option for numsum to -x1,2,3,4.

potong
  • 55,640
  • 6
  • 51
  • 83