using awk to get the right columns and define which delimiter and seperator to use?

Question

I have a csv with 20000 columns. Here a subset of it:

"eid","20216-2.0","20216-3.0","20220-2.0","20220-3.0"
"1548197","1","hello","","2020-03-05"
"2101984","2","string","","2020-03-04"
"2986696","3","no","","2020-04-05"
"1543304","3","ge","","2020-02-10"
"3207207","3","no","","2020-03-20"
"2373538","4","yesterday","","2020-03-01"
"4930973","5","today","","2020-03-06"
"6012673","54","tomorrow","","2020-05-05"
"4978627","1","yes","","2020-03-10"

I want to use awk to get 2 columns:

awk -F "," '{ print $1, $3 }' input.csv  > output.csv

When I check my output.csv file, the results are messed up, like this:

"eid","20216-3.0"
"1548197","2020-03-05"
"2","string"
"no",""
"1543304",""
"","2020-03-20"
"yesterday",""
"4930973","2020-03-06"
"tomorrow","2020-05-05"
"4978627","2020-03-10"

Can someone please help me?

Messing up how? The sample you show certainly doesn't have 6713 columns, so maybe that's the problem, but we really can't tell what results you expect or how to fix this. — tripleee, Jul 30 '20 at 17:52
the csv I showed up was just to show the seperator. the results not homogenous. it is like it took different columns for each row. like it works for the first row(column) — wwla, Jul 30 '20 at 18:06
I'm afraid you'll really need to [edit] to clarify what you are getting now, and what you want to get instead. See also the [help] and in particular [How to ask](/help/how-to-ask) as well as the guidance for providing a [mre]. — tripleee, Jul 30 '20 at 18:08
okay, I am sorry if this wasnt clear and thanks for your help. I will edit right away. — wwla, Jul 30 '20 at 18:18
Is it possible that some of your columns contains commas within the double quotes? Like `"Smith, John"` or `"City, State"`? — John1024, Jul 30 '20 at 18:21
I am not sure, but probably no! if I use pandas df = pd.read_csv('input.csv'), it works fine — wwla, Jul 30 '20 at 18:26
As mentioned above, string fields in CSV files can contain embedded quotes. To handle these properly, try one of the solutions from [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/q/45420535/3744182). — dbc, Jul 30 '20 at 18:34
Pandas knows how to cope with quoted fields, Awk simply splits on every comma, like you literally asked it to. Another possible source of erratic behavior is DOS carriage returns, where again Python / Pandas is more resilient / less Unix-y. But as posted, I don't think this is reproducible. — tripleee, Jul 30 '20 at 18:36
This is so overwhelming. Whatis the best way to understand awk commands?Im sorry for asking too many questions and being so lost — wwla, Jul 30 '20 at 18:39
In that case, `-F ","` is likely the cause of the problem. For the solution see the link from @dbc above or [AWK set multiple delimiters for comma and quotes with commas](https://stackoverflow.com/questions/31083953/awk-set-multiple-delimiters-for-comma-and-quotes-with-commas) — John1024, Jul 30 '20 at 18:39
Awk is pretty simple. It is just your bad luck that you stated by dealing with csv which is one of the most complex parts of awk. — John1024, Jul 30 '20 at 18:41
Honestly `awk` is not the best tool for this job. If `pd.read_csv('input.csv')` is working then find a way to invoke that from your shell script, and move on, e.g. [Shell Script: Execute a python program from within a shell script](https://stackoverflow.com/q/4377109/3744182). — dbc, Jul 30 '20 at 18:43
I recently heard about awk and I believe that it is very powerful. But from someone who knows nothing about it, I think personally it is hard. maybe if I know the logic behind it It would be easier. My csv's size is 30 GB so using python is impossible that's how I stumbled upon awk — wwla, Jul 30 '20 at 18:46
There are standalone CSV-aware tools too if you prefer. You don't really need Pandas; the Python standard `csv` module handles this fine, too. It can definitely handle CSV files of any size if you process them line by line (unless a single line is multiple gigabytes, perhaps). — tripleee, Jul 30 '20 at 18:46
A quick search turns up [python read bigger csv line by line](https://stackoverflow.com/q/51097057/3744182). — dbc, Jul 30 '20 at 18:48
are you on Windows by any chance? (you didn't include neither your OS or awk version). If so, you might consider saving your awk in a file and then running as: ```awk -f myAwkScript.awk myInputFile.csv``` (including your FS and OFS definitions in the BEGIN block: ```FS=OFS=","```) — vgersh99, Jul 30 '20 at 19:32
Your example works perfectly for me, using GNU Awk 5.0.0, API: 2.0 on Windows 10. — Terry Ebdon, Jul 30 '20 at 19:40
the example was oversimplified. the csv has 10000 columns and 500000 rows — wwla, Jul 30 '20 at 20:06
is there a way to use pandas.read_csv and instead of using chunksize for rows use something similar for columns. — wwla, Jul 30 '20 at 20:08
*My csv's size is 30 GB so using python is impossible* -- not true. Python can deal with files line-by-line in the same manner as awk. What may be true is you cannot hold the entire CSV file in memory as Pandas would do. You would use the csv module instead. — dawg, Jul 30 '20 at 21:11
The issue with your example isn't that it's simplified, it's that it doesn't reproduce the problem. Do divide and conquer on your data (or whatever other approach you like) to come up with a CSV that's about 6 lines long with about 6 fields on each line **and still reproduces your issue** (i.e. a [mcve]) when you run your awk script on it and **then** we'll be able to help you. — Ed Morton, Jul 30 '20 at 21:36

score 0 · Accepted Answer · answered Jul 30 '20 at 20:28

0

awk -v FPAT="([^,]+)|(\"[^\"]+\")" '{ print $1, $6713 }' input.csv  > output.csv

Worked! thanks all

answered Jul 30 '20 at 20:28

wwla

41
1
9

You're having to escape double quotes in your FPAT because you used the wrong type of quotes around it. Use `FPAT='([^,]+)|("[^"]+")'`. You also don't need the parens but they arguably improve clarity and don't hurt. That FPAT will fail if you have unquoted empty fields though so you'd be better of with `*` than `+` in the first part of it and it'll fail if you have nested double quotes. All in all, I'd suggest you use `FPAT='[^,]*|("[^"]*")+'` – Ed Morton Jul 30 '20 at 22:51
1

Thank you @EdMorton – wwla Aug 02 '20 at 04:20

using awk to get the right columns and define which delimiter and seperator to use?

1 Answers1