1

I need to split a string into an array in bash. However, not by a fixed delimiter character, but by a regular expression. Here are the details:

I have a CSV file containing millions of lines of the following format:

123,"usrlogin1","usrname2","companyname1","email@example.com","1970-01-01 00:00:00","USR",0,1

However, some fields may contain escaped quotation marks \" and commas within strings, so a line might just as well look like this:

123,"usrlogin1","usrname2","The \"awesome\" company, NA","email@example.com","1970-01-01 00:00:00","USR",0,1

I need to split this file by the correct commas, that is to say those which are outside of non-escaped quotation marks.

Desired output:

123
usrlogin1
usrname2
The "awesome" company, NA
email@example.com
1970-01-01 00:00:00
USR
0
1

This is where I am at:

#!/bin/sh

#read file line by line
while IFS='' read -r line; do

    #replace escaped quotation marks (\") with (")
    line=$(echo $line | sed 's/\\"/\&quot\;/g')

    #split this line now by commas
    #TODO


done < "$1"

At the point of TODO, I would like to use the following regex as delimiter for splitting my string into an array:

(?!\B"[^"]*),(?![^"]*"\B)

Demo: https://regex101.com/r/xB7rQ7/1

How can I split the $line by the regex above?

Matthias Bö
  • 449
  • 3
  • 12
  • 2
    You might consider a real CSV parser instead. [bash: Parse CSV with quotes, commas and newlines](https://stackoverflow.com/questions/36287982/bash-parse-csv-with-quotes-commas-and-newlines) is a good place to start. – Charles Duffy Jun 15 '18 at 15:14
  • ...if you want to run a regex natively in bash, it has to be valid ERE, which the one you gave isn't (ERE has no `(?!` syntax and no `\B`s, for example; there's a valid POSIX ERE equivalent for many backslash-whatever shortcuts, but I don't know what `\B` is offhand to say). – Charles Duffy Jun 15 '18 at 15:15
  • 1
    BTW, `echo $line | ...` is buggy -- let's say your CSV contains `"My * AWESOME * company"` as a field; the `*`s will be replaced with filenames in the directory where you're running your code unless you make it `echo "$line"`. See [BashPitfalls #14](http://mywiki.wooledge.org/BashPitfalls#echo_.24foo) – Charles Duffy Jun 15 '18 at 15:19
  • Your regex makes no sense. `[^"]*` at the end of a look-ahead group is pointless because it always matches (it can match the empty string). `\B` is also suspect: `"` is a quoting character no matter whether it lies on a non-word boundary. – melpomene Jun 15 '18 at 15:30
  • 1
    Your `\"` replacement scheme isn't right for two reasons: 1. Afterwards you can't distinguish `\"` from a field that actually contains `"` initially. 2. If you have a field that contains actual backslashes, they ought to be escaped as \\, meaning `"...\\"` (a literal backslash at the end of a quoted field) would turn into `"...\"`. – melpomene Jun 15 '18 at 15:33
  • 5
    "a file containing millions of lines" -- I would immediately look for a non-bash solution based on this statement. "a CSV file" -- doubly so. – glenn jackman Jun 15 '18 at 15:35
  • Thank you for all the comments. I will scrap my current approach and work with a proper parser instead. Thanks a lot for pointing out all of the issues :) – Matthias Bö Jun 15 '18 at 15:58

0 Answers0