0

I have a .csv with 470 columns and tens of thousands of rows of products, many with text strings including commas, that cause my awk statements to blow out and write to the wrong columns thus corrupting my data. Here are the statements I'm using:

Input example:

LDB0010-300,"TIMELESS DESIGN: Classic, Elegant, Beautiful",Yes,1,Live,...
LDB0010-400,"CLASSIC DESIGN: Contemporary look",No,0,Not Live,...
LDB0010-500,"Everyone should wear this, almost!",Yes,0,Not Live,...

Code:

cat products.csv | sed -e 's/, /@/g' | awk -F, 'NR>1 {$308="LIVE" ; $310="LIVE" ; $467=0 ; print $0}' OFS=, | sed -e 's/@/, /g'

Current output, which is wrong with data written in the wrong columns:

LDB0010-300,"TIMELESS DESIGN: Classic",LIVE, Beautiful",Yes,1,Live,...
LDB0010-400,"CLASSIC DESIGN: Contemporary look",No,0,0,...
LDB0010-500,"Everyone should wear this",LIVE,Yes,0,Not Live,...

When studying the data closer, I noticed that in the cells with text descriptions, commas were always followed with a space, whereas commas used as delimiters had no space after them. So the approach I took was to substitute comma-space with '@', run my awk statement to set the values of those columns, then substitute back from '@' to comma-space. This all looked pretty good until I opened the spreadsheet and noticed that there were many rows with values written into the wrong columns. Does anyone know a better way to do this that will prevent these blow outs?

tripleee
  • 175,061
  • 34
  • 275
  • 318
Mike
  • 15
  • 5
  • 1
    [this](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) should point you in the right direction – vgersh99 Sep 03 '20 at 16:12
  • 1
    Does this answer your question? [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) – thanasisp Sep 03 '20 at 16:15
  • If your CSV isn't actually valid CSV, hard to say. If it is, use a language with a proper CSV parser; many people find Python's CSV support versatile and robust, and the language itself easy to learn enough to replace a couple of Awk statements. – tripleee Sep 03 '20 at 16:15
  • If you're interested, I have an awk clone (using tcl instead) that supports proper splitting of csv data. – Shawn Sep 03 '20 at 17:37
  • Maybe see also https://stackoverflow.com/questions/1560393/bash-shell-scripting-csv-parsing – tripleee Sep 03 '20 at 18:03
  • @Shawn yeah interested. – Mike Sep 03 '20 at 18:15
  • Lives at https://github.com/shawnw/tawk – Shawn Sep 03 '20 at 19:27
  • 1
    @Shawn `tawk` is the name of an awk variant (like gawk, mawk, and nawk), see http://www.tasoft.com/tawk.html, so using the name `tawk` for your software might cause some confusion. There are gawk extension libraries (see http://gawkextlib.sourceforge.net/) to handle various formats (e.g. json and xml) and I expect the CSV one (http://gawkextlib.sourceforge.net/csv/csv.html) will be available soon-ish for anyone wanting to install an external package. – Ed Morton Sep 03 '20 at 23:07

1 Answers1

0

The sample data you posted does not reproduce the symptoms you report with the code you provided. The absolutely simplest explanation is that your observation that commas with a space are always field-internal and other commas are not is in fact incorrect. This should be easy enough to check;

sed 's/, /@/g' products.csv | awk -F, '{ a[NF]++ } END { for (n in a) print n, a[n] }'

If you don't get a single line of output with exactly the correct number of columns and rows, you can tell that your sed trick is not working correctly. (Notice also the fix for the useless cat.)

Anyway, here is a simple Python refactoring which should hopefully be obvious enough. The Python CSV library knows how to handle quoted fields so it will only split on commas which are outside double quotes.

#!/usr/bin/env python3

import csv
import sys


w = csv.writer(sys.stdout)

for file in sys.argv[1:]:
    with open(file, newline='') as inputfile:
        r = csv.reader(inputfile)
        for row in r:
            row[307] = "LIVE"
            row[309] = "LIVE"
            row[466] = 0
            w.writerow(row)

Notice how Python's indexing is zero-based, whereas Awk counts fields starting at one.

You'd run it like

python3 this_script.py products.csv

See also the Python csv module documentation for various options you might want to use.

The above reads all the input files and writes the output to standard output. If you just want to read a single input file and write to a different file, that could be simplified to

#!/usr/bin/env python3

import csv
import sys


with open(sys.argv[1], newline='') as inputfile, open(sys.argv[2], 'w', newline='') as outputfile:
    r = csv.reader(inputfile)
    w = csv.writer(outputfile)
    header = True
    for row in r:
        if not header:  # don't muck with first line
            row[307] = "LIVE"
            row[309] = "LIVE"
            row[466] = 0
        w.writerow(row)
        header = False

You'd run this as

python3 thisscript.py input.csv output.csv

I absolutely hate specifying the output file as a command-line argument (we should have an option instead) but for a quick one-off, I guess this is acceptable.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • thanks... I know the data sample is just a sample, the actual sheet is hundreds of columns and tens of thousands of rows. It was just intended to show what I'm seeing. I don't know python, but I'm sure taking a deep look your python solution, it seems the most elegant. Thanks for sharing and also showing how to call it. QUESTION: does it write the edited rows directly back into the original input it file?? or does it output a different file name??... or rather, can it? – Mike Sep 03 '20 at 18:22
  • This one just prints to standard output (notice the `sys.stdout`), but it's easy enough to change the writer to `with open('newfile.csv', 'w') as outputfile:` and indent the rest of the script, adding `w = csv.writer(outputfile)` inside the `with` block. I updated the answer with a more detailed elaboration. – tripleee Sep 03 '20 at 18:37
  • actually stops giving me errors when I create an output.csv file for it to go to... but doesn't write anything... ```-rw-rw-r-- 1 admin admin 0 Sep 4 14:13 output.csv ``` – Mike Sep 04 '20 at 19:17
  • Works for me (with indices adjusted): https://ideone.com/yTfzbt ... If you have lines which are not long enough you would get a traceback with an `IndexError` and an empty or incomplete output file. – tripleee Sep 04 '20 at 19:29
  • ok I see I had the last line wrong. It's working now. One more question, it overwrites the column headers. is there way to prevent that??? – Mike Sep 04 '20 at 20:05
  • Headers are a nuisance; but sure - see edit. – tripleee Sep 04 '20 at 20:16