0

I am using the following Awk command which is based on this Stack Exchange post:

tail -n +2 *.csv | sort -t',' -k2 | awk -F',' '$2~/^[[:space:]]*$/{next} {sub(/\r$/,"")} $2!=prev{close(out); out=$2".txt"; prev=$2} {print $1 > out}'

The command works perfectly under MacOS 10.14. However, I recently upgraded to MacOS 12.6 and it no longer works. (MacOS 12.6 uses awk version 20200816).

It produces the following error:

awk: newline in regular expression ... at source line 1
 context is
    $2~/^[[:space:]]*$/{next} {sub(/ >>> 
 <<< 
awk: syntax error at source line 1
awk: illegal statement at source line 1

How can I get it working again and ideally (if possible) make it more future proof, without having to install any extra software. I looked at the changes made to awk, but can't find anything that would cause it to stop working.


Background

The command takes all CSV files in a directory. It splits the file into text files according to the values of the second column of the CSV file while only keeping the values stored in the first column.

Example CSV file:

COLUMN 1,COLUMN 2
innovation "is essential",3-Entrepreneurship
countless,
innocent,2-Police
toilet handle,2-Bathroom
née dresses,3-Companies
odorless,2-Sense of Smell
old ideas "new takes",3-Entrepreneurship
new income streams,3-Entrepreneurship
Zoë’s food store,3-Companies
many,
crime "doesn't sleep",2-Police
bath room,2-Bathroom
ring,
móvíl résumés,3-Companies
musty smell's come here,2-Sense of Smell
good publicity guru,3-Entrepreneurship
Señor,3-Companies

E.g. after split

In file 3-Entrepreneurship.txt

innovation "is essential"
old ideas "new takes"
new income streams
good publicity guru

In file 2-Bathroom.txt

toilet handle
bath room

In file 2-Police.txt

innocent
crime "doesn't sleep"

In file 2-Sense of Smell.txt

odorless
musty smell's come here

In file 3-Companies.txt

née dresses
Zoë’s food store
móvíl résumés
Señor
big_smile
  • 1,487
  • 4
  • 26
  • 59
  • 1
    Where is the closing single quote from the awk commands portion? That is, the quote before the $2?. If that's all of the script, you need another quote after the {next}: {next}' – mpez0 May 13 '23 at 14:58
  • 1
    Please, read [How to create a Minimal, Complete, and Verifiable Example.](https://stackoverflow.com/help/mcve) With a reproductible example, it will allow us to try. Please, edit your original post. No comments. – Gilles Quénot May 13 '23 at 15:01
  • 2
    Your code works perfectly on MacOS 13.3.1 inside of Coderunner and at the command line in zsh. I don't think your issue is awk related. It is likely some sort of shell quoting issue. Are you running that pipe from some sort of txt script file? Or are you just pasting into a shell? How are you are executing this is likely your issue. – dawg May 13 '23 at 15:32
  • 1
    What version of `awk` are you using (do `awk --version` to reveal that)? – Daweo May 13 '23 at 15:35
  • @markp-fuso @dawg Replacing '`r' with `\x0d` fixed it. Although it won't work if there are smart quotes. Is there any way to make it work with smart quotes? If not then @markp-fuso please submit the `\x0d` replacement as the solution, so I can mark it as the correct answer. – big_smile May 13 '23 at 16:24
  • I'm curious curious, does the command `echo | awk 'sub(/\r/,"")'` fails too? – Fravadona May 13 '23 at 18:18
  • There is no part of shell programming that considers "smart quotes" as quotes. Quotes are either single `'` or double `"`, see https://mywiki.wooledge.org/Quotes, and if you use anything else then you're simply not using quotes and so inviting problems. Don't change a script to work around issues with smart quotes, use single (or double in some cases) quotes and **do not use "smart quotes"**. – Ed Morton May 14 '23 at 12:12
  • To be clear - using "smart quotes" is like not using quotes at all so your script is exposed to the shell for interpretation (which is a very, very bad thing) so your shell is converting `\r` to a literal Carriage Return character before awk sees it, hence awk complaining that there's a newline in the middle of your script. I expect you could reproduce this using `echo | awk sub(/\r/,"")` with no quotes around the script. – Ed Morton May 14 '23 at 12:34

2 Answers2

1

The solution I posted nearly 3 years ago still works:

# the files produced must not exist prior to the run
awk -F, 'FNR>1 && $2 {print $1 >> ($2 ".txt"); close($2 ".txt")}' file.csv

Produces:

$ head *.txt
==> 2-Bathroom.txt <==
toilet handle
bath room

==> 2-Police.txt <==
innocent
crime "doesnt sleep"

==> 2-Sense of Smell.txt <==
odorless
musty smells come here

==> 3-Companies.txt <==
née dresses
Zoë’s food store
móvíl résumés
Señor

==> 3-Entrepreneurship.txt <==
innovation "is essential"
old ideas "new takes"
new income streams
good publicity guru

Or, here is a Ruby:

ruby -r csv -e '
CSV.parse($<.read, **{:headers=>true, :liberal_parsing=>true}).
    select{|r| r["COLUMN 2"]}.
    group_by{|r| r["COLUMN 2"]}.
    each{|k,v| File.write("#{k}.txt", v.map(&:first).map(&:last).join("\n")) 
}
' file.csv
# same output
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Thanks, that works but it puts question marks in the file name before the extension. Is there a way to prevent that? Thanks. – big_smile May 13 '23 at 16:35
  • Those would be the `\r`s that this script is not removing from the end of each line before producing output. Your existing script is using `{sub(/\r$/,"")}` to do that. See [why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it](https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it) for more info on those `\r`s. – Ed Morton May 14 '23 at 12:22
  • In addition to not removing the `\r`s the other functional differences between this script and the one you're using are that this one would not create an output file if `$2` contained `0` or `0.00` or similar while your existing one would create it, and this one will close and re-open the output file for every line being produced whether the output file name changes or not while your existing one will leave the output file open as long as $2 doesn't change, and this one will append to any existing files with the same name while your existing script will overwrite them every time it's called. – Ed Morton May 14 '23 at 12:27
0

Looks like it's treating the \r as a literal linefeed (possible issue with using smart quotes?).

You might try, say, replacing \r with \x0d to see if that make a difference.

markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • 1
    This is trying to work around using "smart quotes" instead of single quotes `'` around the script - don't do that, fix the quotes instead and leave `\r` as-is. – Ed Morton May 14 '23 at 12:30