1

I'm working with a hand fill file and I am having issue to parse it. My file input file cannot be altered, and the language of my code can't change from bash script.

I made a simple example to make it easy for you ^^

var="hey","i'm","happy, like","you"
IFS="," read -r one two tree for five <<<"$var"
echo $one:$two:$tree:$for:$five

Now I think you already saw the problem here. I would like to get

hey:i'm:happy, like:you:

but I get

hey:i'm:happy: like:you

I need a way to tell the read that the " " are more important than the IFS. I have read about the eval command but I can't take that risk.

To end this is a directory file and the troublesome field is the description one, so it could have basically anything in it.

original file looking like that

"type","cn","uid","gid","gecos","description","timestamp","disabled"
"type","cn","uid","gid","gecos","description","timestamp","disabled"
"type","cn","uid","gid","gecos","description","timestamp","disabled"

Edit #1

I will give a better exemple; the one I use above is too simple and @StefanHegny found it cause another error.

while read -r ldapLine
    do
            IFS=',' read -r objectClass dumy1 uidNumber gidNumber username description modifyTimestamp nsAccountLock gecos homeDirectory loginShell createTimestamp dumy2 <<<"$ldapLine"

            isANetuser=0

            while IFS=":" read -r -a class
            do
                    for i in "${class[@]}"
                    do
                            if [ "$i" == "account" ]
                            then
                                    isANetuser=1
                                    break
                            fi
                    done
            done <<< $objectClass

            if [ $isANetuser == 0 ]
            then
                    continue
            fi

            #MORE STUFF APPEND#

    done < file.csv

So this is a small part of the code but it should explain what I do. The file.csv is a lot of lines like this:

"top:shadowAccount:account:posixAccount","Jdupon","12345","6789","Jdupon","Jean Mark, Dupon","20140511083750Z","","Jean Mark, Dupon","/home/user/Jdupon","/bin/ksh","20120512083750Z","",""
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Guillaumedk
  • 35
  • 1
  • 7
  • 2
    Could be related to http://stackoverflow.com/questions/8940352/how-to-handle-commas-within-a-csv-file-being-read-by-bash-script, it just might help – Inian May 25 '16 at 09:37
  • 1
    Probably best not to try text processing with bash, no idea why you wouldn't be able to use something actually meant for text processing.Trying to parse csv data with bash is ridiculous. – 123 May 25 '16 at 09:48
  • 1
    The $var as written will not contain any `""` any more, the prob is not on the IFS side but that $var looks `"hey,i'm,happy, like,you"` and so this presumed "context" is lost. – Stefan Hegny May 25 '16 at 09:56
  • @StefanHegny in real life i use 'while read -r var; do #work; done < /temp/file' does that change something ? – Guillaumedk May 25 '16 at 11:38
  • @123 i know that, but I've been push into a project by my team, and i have to develop an audit tool all in bash. thrust me it's boring enough, so i don't need to have someone reminding me that ... <3 – Guillaumedk May 25 '16 at 11:51
  • @Guillaumedk, sure, when you read the `"hey","i'm","happy, like","you"` that way as var, then the string will contain the quotes in the variable (i.e. e.g. if this line was in your /temp/file). Then we end up with your maybe original problem. Anyway, I suggest you check the link given in Inians comment about the csv handling about an idea how to handle this - using the , as IFS is not the way to go... – Stefan Hegny May 25 '16 at 12:07
  • @StefanHegny sadly i only have GNU Awk 3.1.7 at my disposal. – Guillaumedk May 25 '16 at 12:18
  • @Guillaumedk So it doesn't have to be in bash ? – 123 May 25 '16 at 12:43
  • @123 sadly it must be in bash. it's a stupid 3 month project in witch i should analyses a fix thing with result from a tool, but the tool is broken and i have to make a new one in bash. – Guillaumedk May 25 '16 at 13:14
  • @Guillaumedk But you just said you can use awk ? – 123 May 25 '16 at 13:23
  • @123 yea, as i can use sed, in my bash script, did i forgot to say it was a script ? Damn .... sorry ^^ what i did mean it that it have to be all in one bash script, so basic tool could work. The script will have to work on a bunch version of aix, redhat and solaris. – Guillaumedk May 25 '16 at 13:47
  • Do they all have perl? Super easy in perl. – 123 May 25 '16 at 13:50
  • @123,i guess they have, but i can't have 2 separate file. if i can write Perl in a bash file, it could work ^^ – Guillaumedk May 25 '16 at 13:58
  • `awk` is an entirely different scripting language (and an extremely powerful one for text-processing). Embedding awk in bash is like embedding perl in bash, or python in bash, which is why saying you need a pure-bash answer will lead many folks to assume you can't use awk. (Scripts with a `#!/usr/bin/awk -f` shebang are by no means unheard of). – Charles Duffy May 25 '16 at 15:23
  • If you'd searched for `csv` when trying to find duplicates, btw, you probably would have found the preexisting question (and answers) yourself. – Charles Duffy May 25 '16 at 15:29

2 Answers2

2

If the various bash versions you will use are all more recent than v3.0, when regexes and BASH_REMATCH were introduced, you could use something like the following function: [Note 1]

each_field () {
    local v=,$1;
    while [[ $v =~ ^,(([^\",]*)|\"[^\"]*\") ]]; do
        printf "%s\n" "${BASH_REMATCH[2]:-${BASH_REMATCH[1]:1:-1}}";
        v=${v:${#BASH_REMATCH[0]}};
    done
}

It's argument is a single line (remember to quote it!) and it prints each comma-separated field on a separate line. As written, it assumes that no field has an enclosed newline; that's legal in CSV, but it makes dividing the file into lines a lot more complicated. If you actually needed to deal with that scenario, you could change the \n in the printf statement to a \0 and then use something like xargs -0 to process the output. (Or you could insert whatever processing you need to do to the field in place of the printf statement.)

It goes to some trouble to dequote quoted fields without modifying unquoted fields. However, it will fail on fields with embedded double quotes. That's fixable, if necessary. [Note 2]

Here's a sample, in case that wasn't obvious:

while IFS= read -r line; do
  each_field "$line"
  printf "%s\n" "-----"
done <<EOF
type,cn,uid,gid,gecos,"description",timestamp,disabled
"top:shadowAccount:account:posixAccount","Jdupon","12345","6789","Jdupon","Jean Mark, Dupon","20140511083750Z","","Jean Mark, Dupon","/home/user/Jdupon","/bin/ksh","20120512083750Z","",""

EOF

Output:

type
cn
uid
gid
gecos
description
timestamp
disabled
-----
top:shadowAccount:account:posixAccount
Jdupon
12345
6789
Jdupon
Jean Mark, Dupon
20140511083750Z

Jean Mark, Dupon
/home/user/Jdupon
/bin/ksh
20120512083750Z


-----

Notes:

  1. I'm not saying you should use this function. You should use a CSV parser, or a language which includes a good CSV parsing library, like python. But I believe this bash function will work, albeit slowly, on correctly-formatted CSV files of a certain common CSV dialect.

  2. Here's a version which handles doubled quotes inside quoted fields, which is the classic CSV syntax for interior quotes:

    each_field () { 
        local v=,$1;
        while [[ $v =~ ^,(([^\",]*)|\"(([^\"]|\"\")*)\") ]]; do
            echo "${BASH_REMATCH[2]:-${BASH_REMATCH[3]//\"\"/\"}}";
            v=${v:${#BASH_REMATCH[0]}};
        done
    }
    
rici
  • 234,347
  • 28
  • 237
  • 341
  • Hello again, i am not sure to understand the this ouput of it. I was thinking of a mistake and that you could mean `printf "%s\n" "-----"`. but i still don't understand how can i save each value, and what `v=${v:${#BASH_REMATCH[0]}};` is for. – Guillaumedk May 26 '16 at 12:34
  • Yes that should be `%s`, sorry. Fixed. `v=${v:${#BASH_REMATCH[0]}};` discards the field which was just printed. – rici May 26 '16 at 12:42
  • OK so you print each field one by one and discard it every time. I actually try to implement it without having to rewrite my 400l ... a bash script should never exceed 200, and i'm far from over – Guillaumedk May 26 '16 at 12:56
  • That work very well, except that I'm not sure i have understand it all. Thank you for the time and effort ! – Guillaumedk May 27 '16 at 07:52
  • i have some trouble to save this to an array. i have try a lot of thing, but the closer i could get to a working code was `array+=(${BASH_REMATCH[2]:-${BASH_REMATCH[3]//\"\"/\"}})` but when i have space in a field it was spiting. i try to add `" ${BASH_RE..... "` but it was outputting all field in one line. – Guillaumedk May 27 '16 at 15:51
  • 1
    @Guillaumedk: You need to quote the expression inside the array assignment: `array+=("${BASH_REMATCH[2]:-${BASH_REMATCH[3]//\"\"/\"}}")` – rici May 27 '16 at 16:25
0

My suggestion, as in some previous answers (see below), is to switch the separator to | (and use IFS="|" instead):

sed -r 's/,([^,"]*|"[^"]*")/|\1/g'

This requires a sed that has extended regular expressions (-r) however.

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

Is it possible to write a regular expression that matches a particular pattern and then does a replace with a part of the pattern

Community
  • 1
  • 1
Jeff Y
  • 2,437
  • 1
  • 11
  • 18
  • i will need to try with something like a #, because the one who fill the directory had use | .... i hope they will not use #, but i'm still looking for a foolproof answer ^^ – Guillaumedk May 25 '16 at 13:55
  • 1
    The foolproof answer is to use a CSV processing tool to process CSV data. The shell is not designed for it and CSV formatting with commas inside double quoted strings makes it extremely difficult to do. – Jonathan Leffler May 25 '16 at 15:23