Unix script, awk and csv handling

Question

Unix noob here again. I'm writing a script for a Unix class I'm taking online, and it is supposed to handle a csv file while excluding some info and tallying other info, then print a report when finished. I've managed to write something (with help) but am having to debug it on my own, and it's not going well.

After sorting out my syntax errors, the script runs, however all it does is print the header before the loop at the end, and one single value of zero under "Gross Sales", where it doesn't belong.

My script thus far: I am open to any and all suggestions. However, I should mention again, I don't know what I'm doing, so you may have to explain a bit.

#!/bin/bash

grep -v '^C' online_retail.csv \
| awk -F\\t '!($2=="POST" || $8=="Australia")' \
| awk -F\\t '!($2=="D" || $3=="Discount")' \
| awk -F\\t '{(country[$8] += $4*$6) && (($6 < 2) ? quantity[$8] += $4 : quantity[$8] += 0)} \
END{print "County\t\tGross Sales\t\tItems Under $2.00\n \
-----------------------------------------------------------"; \
for (i in country) print i,"\t\t",country[i],"\t\t", quantity[i]}'

THE ASSIGNMENT Summary:

Using only awk and sed (or grep for the clean-up portion) write a script that prepares the following reports on the above retail dataset.

First, do some data "clean up" -- you can use awk, sed and/or grep for these:

Invoice numbers with a 'C' at the beginning are cancellations. These are just noise -- all lines like this should be deleted. Any items with a StockCode of "POST" should be deleted. Your Australian site has been producing some bad data. Delete lines where the "Country" is set to "Australia". Delete any rows labeled 'Discount' in the Description (or have a 'D' in the StockCode field). Note: if you already completed steps 1-3 above, you've probably already deleted these lines, but double-check here just in case. Then, print a summary report for each region in a formatted table. Use awk for this part. The table should include:

Gross sales for each region (printed in any order). The regions in the file, less Australia, are below. To calculate gross sales, multiply the UnitPrice times the Quantity for each row, and keep a running total. France United Kingdom Netherlands Germany Norway Items under $2.00 are expected to be a big thing this holiday season, so include a total count of those items per region. Use field widths so the columns are aligned in the output. The output table should look like this, although the data will produce different results. You can format the table any way you choose but it should be in a readable table with aligned columns. Like so:

Country             Gross Sales         Items Under $2.00
---------------------------------------------------------
France              801.86              12
Netherlands         177.60              1
United Kingdom      23144.4             488
Germany             243.48              11
Norway              1919.14             56

A small sample of the csv file:

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536388,22469,HEART OF WICKER SMALL,12,12/1/2010 9:59,1.65,16250,United Kingdom
536388,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,12/1/2010 9:59,1.65,16250,United Kingdom
C536379,D,Discount,-1,12/1/2010 9:41,27.5,14527,United Kingdom
536389,22941,CHRISTMAS LIGHTS 10 REINDEER,6,12/1/2010 10:03,8.5,12431,Australia
536527,22809,SET OF 6 T-LIGHTS SANTA,6,12/1/2010 13:04,2.95,12662,Germany
536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,12/1/2010 13:04,2.55,12662,Germany
536532,22962,JAM JAR WITH PINK LID,12,12/1/2010 13:24,0.85,12433,Norway
536532,22961,JAM MAKING SET PRINTED,24,12/1/2010 13:24,1.45,12433,Norway
536532,84375,SET OF 20 KIDS COOKIE CUTTERS,24,12/1/2010 13:24,2.1,12433,Norway
536403,POST,POSTAGE,1,12/1/2010 11:27,15,12791,Netherlands
536378,84997C,BLUE 3 PIECE POLKADOT CUTLERY SET,6,12/1/2010 9:37,3.75,14688,United Kingdom
536378,21094,SET/6 RED SPOTTY PAPER PLATES,12,12/1/2010 9:37,0.85,14688,United Kingdom

Seriously, Thank you to whoever can help. You all are amazing!

edit: I think I used the wrong field separator... but not sure how it is supposed to look. still tinkering...

edit2: okay, I "fixed?" the delimiter and changed it from awk -F\\t to awk -F\\',' and now it runs, however, the report data is all incorrect. sigh... I will trudge on.

See [How do I ask and answer homework questions?](https://meta.stackoverflow.com/a/334823/14122) on [meta]. A question should be about a specific technical problem you encountered when doing your assignment, with everything not needed to demonstrate or test fixes to _that specific problem_ factored out. A request for general suggestions is too broad to be on topic; focus in on a specific problem (after trying to find other answers to that problem elsewhere in the knowledge base). — Charles Duffy, Apr 19 '21 at 02:11
Okay, I will do that. Sorry for being vague, I really did try to come with specifics. — DeweyWoodz, Apr 19 '21 at 02:17
I think the _amount_ of specifics (background, assignment details, etc) is part of what made it hard to focus in on exactly one problem. Building a [mre] that isolates only the most immediate issue (like "commas not being correctly used for field splitting") and removes everything unrelated to that issue makes for a question that's much easier to follow, and likewise easier for someone else with the same problem to successfully search for and learn from. — Charles Duffy, Apr 19 '21 at 12:31

David C. Rankin · Accepted Answer · 2021-04-19T03:18:54.983

You professor is hinting at what you should use to do this. awk and awk alone in a single call to awk processing all records in your file. You can do it with three rules.

the first rule simply sets the conditions, which if found in the record (line), causes awk to skip to the next line ignoring the record. According to the description, that would be:

    FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next }

your second rule simply sums the Quantity * Unit Price for each Country and also keeps track of the sales of low-price goods if the Unit Price < 2. The a[] array tracks the total sales per Country, and the lpc[] array tracks the low-price cost goods sold if the Unit Price is less than $2.00. That can simply be:

    {
        a[$NF] += $4 * $6
        if ($6 < 2)
            lpc[$NF] += $4
    }

The final END rule just outputs the heading and outputs the table formatted in column form. That could be:

    END {
        printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
        for (i = 0; i < 57; i++)
            printf "-"
        print ""
        for (i in a)
            printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
    }

That's it, if you put it altogether and providing your input in file, you would have:

awk -F, '
    FNR==1 || /^C/ || $2=="POST" || $NF=="Australia" || $2=="D" { next } 
    {
        a[$NF] += $4 * $6
        if ($6 < 2)
            lpc[$NF] += $4
    }
    END {
        printf "%-20s%-20s%s", "Country", "Group Sales", "Items Under $2.00\n"
        for (i = 0; i < 57; i++)
            printf "-"
        print ""
        for (i in a)
            printf "%-20s%10.2f%10s%s\n", i, a[i], " ", lpc[i]
    }
' file

Example Use/Output

You can just select-copy and middle-mouse-paste the command above into an xterm with the current working directory containing file. Doing so, your results would be:

Country             Group Sales         Items Under $2.00
---------------------------------------------------------
United Kingdom           72.30          36
Germany                  33.00
Norway                   95.40          36

Which is similar to the format specified -- though I took the time to decimal align the Group Sales for easy reading.

Look things over an let me know if you have further questions.

note: processing CSV files with awk (or sed or grep) is generally not a good idea IF the values can contain embedded commas within double-quoted fields, e.g.

field1,"field2,part_a,part_b",fields3,...

This prevents problems with choosing a field-separator that will correctly parse the file.

If your input does not have embedded commas (or separators) in the fields, awk is perfectly fine. Just be aware of the potential gotcha depending on the data.

A great post that discussed the problems parsing CSV files with embedded separators in fields (and offers good solutions) is [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/q/45420535/3422102) — David C. Rankin, Apr 19 '21 at 03:33

Unix script, awk and csv handling

1 Answers1