0

Parsing .csv file. It contains some cells with text like:

,"Some words, some more words, an so on",

So using , delimeter doesn't work correctly. The only solution i see is a regex pattern, which matches the string. To replace commas inside " " with some rarely used symbol combination (like '___'). And to replace back to the original after script finish it's job.

Something like echo ${var//in/out}

But i'm not strong in regular expressions. And maybe i don't see more obvious solution.

Any help appreciated.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Andrew
  • 7
  • 2
  • 1
    Parsing a quoted CSV with standard tools is tricky, and when a quoted field contains a literal newline character it's even more difficult. There exists external tools for manipulating CSV files, they can make your life easier; you should try Miller https://miller.readthedocs.io/en/latest/ or CSVtools https://github.com/DavyLandman/csvtools – Fravadona Jun 06 '22 at 12:57
  • [RFC 4180](https://datatracker.ietf.org/doc/html/rfc4180#section-2) defines RFC and says to use quotes if the field contains the separator character, like here the commas. Try to get a csv library, regex is the wrong tool to parse csv. – Robert Jun 06 '22 at 15:36
  • You mention `And to replace back to the original after script finish it's job.` - depending on what the job is, it's very likely you don't need to do any replacement at all. Post a new question if you'd like help with whatever it is you're trying to do. – Ed Morton Jun 06 '22 at 18:11

2 Answers2

1

For the conversion you're trying to do all you need is:

$ awk 'BEGIN{FS=OFS="\""} {for (i=2;i<=NF;i+=2) gsub(/ *, */," ",$i)} 1' file
,"Some words some more words an so on",

For anything more interesting see What's the most robust way to efficiently parse CSV using awk? for how to use awk on CSVs.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
-1

Got the solution by myself.

First replace all сombinations of , to ___ in initial file

cat ./initial_file.csv | sed -e 's|, |___|g'

In the end replace back all ___ to initial ,

sed -i 's|___|, |g' ./final_file.csv

Not so tricky :)

Andrew
  • 7
  • 2
  • Well, that would work when the commas in your quoted fields are **always** followed by a space **and** there's no other field that starts with a space. No one but you can guess that your CSV satisfy both conditions – Fravadona Jun 06 '22 at 13:58
  • That's clear. I didn't try to solve some common issue, but just a small task. – Andrew Jun 06 '22 at 14:52
  • That would convert any `___` that exists in the input to ```,```s – Ed Morton Jun 06 '22 at 16:13