I need to split a string into an array in bash. However, not by a fixed delimiter character, but by a regular expression. Here are the details:
I have a CSV file containing millions of lines of the following format:
123,"usrlogin1","usrname2","companyname1","email@example.com","1970-01-01 00:00:00","USR",0,1
However, some fields may contain escaped quotation marks \"
and commas within strings, so a line might just as well look like this:
123,"usrlogin1","usrname2","The \"awesome\" company, NA","email@example.com","1970-01-01 00:00:00","USR",0,1
I need to split this file by the correct commas, that is to say those which are outside of non-escaped quotation marks.
Desired output:
123
usrlogin1
usrname2
The "awesome" company, NA
email@example.com
1970-01-01 00:00:00
USR
0
1
This is where I am at:
#!/bin/sh
#read file line by line
while IFS='' read -r line; do
#replace escaped quotation marks (\") with (")
line=$(echo $line | sed 's/\\"/\"\;/g')
#split this line now by commas
#TODO
done < "$1"
At the point of TODO
, I would like to use the following regex as delimiter for splitting my string into an array:
(?!\B"[^"]*),(?![^"]*"\B)
Demo: https://regex101.com/r/xB7rQ7/1
How can I split the $line
by the regex above?