Multiple input files in AWK
In the string you are passing to subprocess.call()
, your if
statement is evaluating a string (probably not the comparison you want). It might be easier to just simplify the shell command by doing everything in AWK. You are executing AWK for every $i
in the shell's for
loop. Since you can give multiple input files to AWK, there is really no need for this loop.
You might want to scan through the entire files until you find any line that has other than 12 fields, and not only check the first line (NR==1
). In this case, the condition would be only NF!=12
.
If you want to check only the first line of each file, then NR==1
becomes FNR==1
when using multiple files. NR
is the "number of records" (across all input files) and FNR
is "file number of records" for the current input file only. These are special built-in variables in AWK.
Also, the syntax of AWK allows for the blocks to be executed only if the line matches some condition. Giving no condition (as you did) runs the block for every line. For example, to scan through all files given to AWK and print the name of a file with other than 12 fields on the first line, try:
awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv
I have added the .csv
to your wildcard *dim*
as you had in the Python version. The -F,
of course changes the field separator to a comma from the default space. For every line in each file, AWK checks if the number of fields NF
is 12, if it's not, it executes the block of code, otherwise it goes on to the next line. This block prints the FILENAME
of the current file AWK is processing, then skips to the beginning of the next file with nextfile
.
Try running this AWK version with your subprocess
module in Python:
subprocess.call("""awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv""", shell=True)
The triple quotes makes it a literal string. The output of AWK goes to stdout and I'm assuming you know how to use this in Python with the subprocess module.
Using only Python
Don't forget that Python is itself an expressive and powerful language. If you are already using Python, it may be simpler, easier, and more portable to use only Python instead of a mixture of Python, bash, and AWK.
You can find the names of files (selected from *dim*.csv
) with the first line of each file having other than 12 comma-separated fields with:
import glob
files_found = []
for filename in glob.glob('*dim*.csv'):
with open(filename, 'r') as f:
firstline = f.readline()
if len(firstline.split(',')) != 12:
files_found.append(filename)
f.close()
print(files_found)
The glob
module gives the listing of files matching the wildcard pattern *dim*.csv
. The first line of each of these files is read and split into fields separated by commas. If the number of these fields is not 12, it is added to the list files_found
.