-1

Trying to print filename of files that don't have 12 columns.

This works at the command line:

for i in *dim*; do awk -F',' '{if (NR==1 && NF!=12)print FILENAME}' $i; done;

When I try to embed this in subprocess.call in a python script, it doesn't work:

subprocess.call("""for %i in (*dim*.csv) do (awk -F, '{if ("NR==1 && NF!=12"^) {print FILENAME}}' %i)""", shell=True)

The first error I received was "Print is unexpected at this time" so I googled and added ^ within parentheses. Next error was "unexpected newline or end of string" so googled again and added the quotes around NR==1 && NF!=12. With the current code it's printing many lines in each file so I suspect something is wrong with the if statement. I've used awk and for looped before in this style in subprocess.call but not combined and with an if statement.

weirdan
  • 2,499
  • 23
  • 27
Sara
  • 227
  • 1
  • 2
  • 9
  • Do you really only intend to test if the first line (`NR==1`) has other than 12 fields (`NF!=12`), or do you want to scan through the entire file checking the number of fields on every line? – e0k Jan 22 '16 at 23:17
  • start by passing the same command in both cases. At the moment, the commands are different -- there is no reason to expect that they produce the same result. – jfs Jan 23 '16 at 07:33

1 Answers1

1

Multiple input files in AWK

In the string you are passing to subprocess.call(), your if statement is evaluating a string (probably not the comparison you want). It might be easier to just simplify the shell command by doing everything in AWK. You are executing AWK for every $i in the shell's for loop. Since you can give multiple input files to AWK, there is really no need for this loop.

You might want to scan through the entire files until you find any line that has other than 12 fields, and not only check the first line (NR==1). In this case, the condition would be only NF!=12.

If you want to check only the first line of each file, then NR==1 becomes FNR==1 when using multiple files. NR is the "number of records" (across all input files) and FNR is "file number of records" for the current input file only. These are special built-in variables in AWK.

Also, the syntax of AWK allows for the blocks to be executed only if the line matches some condition. Giving no condition (as you did) runs the block for every line. For example, to scan through all files given to AWK and print the name of a file with other than 12 fields on the first line, try:

    awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv

I have added the .csv to your wildcard *dim* as you had in the Python version. The -F, of course changes the field separator to a comma from the default space. For every line in each file, AWK checks if the number of fields NF is 12, if it's not, it executes the block of code, otherwise it goes on to the next line. This block prints the FILENAME of the current file AWK is processing, then skips to the beginning of the next file with nextfile.

Try running this AWK version with your subprocess module in Python:

    subprocess.call("""awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv""", shell=True)

The triple quotes makes it a literal string. The output of AWK goes to stdout and I'm assuming you know how to use this in Python with the subprocess module.

Using only Python

Don't forget that Python is itself an expressive and powerful language. If you are already using Python, it may be simpler, easier, and more portable to use only Python instead of a mixture of Python, bash, and AWK.

You can find the names of files (selected from *dim*.csv) with the first line of each file having other than 12 comma-separated fields with:

import glob

files_found = []
for filename in glob.glob('*dim*.csv'):
    with open(filename, 'r') as f:
        firstline = f.readline()
        if len(firstline.split(',')) != 12:
            files_found.append(filename)
            f.close()

print(files_found)

The glob module gives the listing of files matching the wildcard pattern *dim*.csv. The first line of each of these files is read and split into fields separated by commas. If the number of these fields is not 12, it is added to the list files_found.

e0k
  • 6,961
  • 2
  • 23
  • 30
  • Thanks for the reply/help. I definitely only want to check the first line. I tried awk -F, 'NR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv but that did not yield the desired result (didn't print out any filenames) unlike when I for loop (which works). When I tried your subprocess.call code the command line hangs. – Sara Jan 25 '16 at 15:07
  • For multiple input files, you need `FNR==1` instead of `NR==1`. It could be hanging because of pipe issues. See also [this question](http://stackoverflow.com/questions/13332268/python-subprocess-command-with-pipe) on how to set up the pipes. Using `subprocess` and [especially with `shell=True`](https://docs.python.org/2/library/subprocess.html#frequently-used-arguments) opens a can of worms you really don't have to open. (See edits above.) – e0k Jan 25 '16 at 17:19
  • This helped me with a similar problem I had in a FOR loop in DOS. Using AWK's if() worked fine until I put it inside a DOS FOR loop. The parentheses were creating parsing complexity that was eliminated by getting rid of the if() statement and simply using this to print all lines with "error" in the 3rd field: awk -F, '$3 == "error"{print $2}' – Bob Oct 09 '19 at 17:05