2

I need to "extract" certain columns from a CSV file. The list of columns to extract is long and their indices do not follow a regular pattern. So far I've come up with a regular expression for a comma-separated value but I find it frustrating that in the RHS side of sed's substitute command I cannot reference more than 9 saved strings. Any ideas around this?

Note that comma-separated values that contain a comma must be quoted so that the comma is not mistaken for a field delimiter. I'd appreciate a solution that can handle such values properly. Also, you can assume that no value contains a new line character.

SJU
  • 187
  • 1
  • 7

4 Answers4

3

With GNU awk:

$ cat file
a,"b,c",d,e

$ awk -vFPAT='([^,]*)|("[^"]+")' '{print $2}' file
"b,c"
$ awk -vFPAT='([^,]*)|("[^"]+")' '{print $3}' file
d

$ cat file
a,"b,c",d,e,"f,g,h",i,j

$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, -vcols="1,5,7,2" 'BEGIN{n=split(cols,a,/,/)} {for (i=1;i<=n;i++) printf "%s%s", $(a[i]), (i<n?OFS:ORS)}' file
a,"f,g,h",j,"b,c"

See http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content for details. I doubt if it'd handle escaped double quotes embedded in a field, e.g. a,"b""c",d or a,"b\"c",d.

See also What's the most robust way to efficiently parse CSV using awk? for how to parse CSVs with awk in general.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    While I still think a fully fledged CSV parser would be the right tool for that, this answer is great! Never knew about `FPAT` – hek2mgl Apr 17 '15 at 21:44
  • 1
    FPAT is new in gawk 4.*. If you don't have it you can do the same thing with a loop and match()+substr(). – Ed Morton Apr 17 '15 at 21:45
  • It would get a little more complicated if the input would be `a,"b,\"c\"",d,e,"f,g,h",i,j`. The regex should look like [this](http://stackoverflow.com/a/2039820/171318) . Also still the solution supports the `,` as the delimiter only. What if you support multiple delimiters and the input would look like `a,b;c` ? Imo, a stable solution needs the delimiter as input argument, it cannot guess it. I guess this can be even proved by math. – hek2mgl Apr 17 '15 at 21:54
  • 1
    Right, I mentioned at the end of the answer that it probably couldn't support embedded escaped quotes. To be robust, you'd also need a copy of the "standard" that defines the "CSV" language you are using. Trying to write a general CSV parser is like trying to write an Algol-like-language parser. Its a losing battle if you don't have a BNF for the SPECIFIC Algol-like-language you are trying to parse (C, awk, Java, pascal, Ada,....). IMHO if you have a "CSV" file that you KNOW has one delimiter and no embedded quotes or newlines, may as well keep the solution cheap and cheerful too. – Ed Morton Apr 17 '15 at 22:03
  • 1
    Oh, I've overseen that you already mentioned that problems can arise if there are embedded double quotes in the data! – hek2mgl Apr 17 '15 at 22:10
  • Regarding escape quotes: Maybe replacing all \" with say "__QUOTE__" prior to awk - ```cat source.csv | | sed 's/\\"/__QUOTE__/' | awk ... | sed 's/__QUOTE__/\\"/'``` Can be also done with gsub() inside AWK... – elixon Jun 17 '19 at 18:45
  • @elixon yeah there's lots of options. See https://stackoverflow.com/q/45420535/1745001 for how to parse CSVs with awk in general and https://stackoverflow.com/a/35708616/1745001 for how to create/use temporary strings that can't be present in the input (if/when using a simple `\n` isn't an option within awk) rather than hoping `__QUOTE__` or any other string doesn't exist in the input. – Ed Morton Jun 17 '19 at 19:03
1

CSV is not that easy to parse like it might look in the first place. This is because there can be a plenty of different delimiters or fixed column widths to separate the data, and also the data may contain the delimiter itself (escaped).

Like I already told here I would use a programming language which supports a CVS library for that.

Use

Community
  • 1
  • 1
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
0

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
I provided sample code within my answer here: parse csv file using gawk

Community
  • 1
  • 1
Chris Koknat
  • 3,305
  • 2
  • 29
  • 30
0

There is command-line csvtool available - https://colin.maudry.com/csvtool-manual-page/

# apt-get install csvtool

elixon
  • 1,123
  • 12
  • 15