Ignore comma after backslash in a line in a text file using awk or sed

Question

I have a text file containing several lines of the following format:

name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school

I need to parse the text file and print the output of fields ignoring the escaped commas. Here those will be fields 2 or 3 like this:

science, social
tennis, ping_pong, chess

I do not know how to ignore escaped characters. How can I do it with awk or sed in terminal?

oguz ismail · Accepted Answer · 2019-04-01T20:23:40.183

3

Substitute \, with a character that your records do not contain normally (e.g. \n), and restore it before printing. For example:

$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file
science,social
painting

Since first gsub is performed on the whole record (i.e $0), awk is forced to recompute fields. But the second one is performed on only second field (i.e $2), so it will not affect other fields. See: Changing Fields.

To be able to extract multiple fields with properly escaped commas you need to gsub \ns in all fields with a for loop as in the following example:

$ awk 'BEGIN{ FS=OFS="," } NR>1{ if(gsub(/\\,/,"\n")) for(i=1;i<=NF;++i) gsub(/\n/,"\\,",$i); print $2,$3 }' file
science\,social,football
painting,tennis\,ping_pong\,chess

See also: What's the most robust way to efficiently parse CSV using awk?.

edited Apr 01 '19 at 20:23

answered Apr 01 '19 at 13:09

oguz ismail

1
16
47
69

How to not print empty lines if the list is empty in those columns? Right now this solution prints an empty line if the list is empty. – Lax_Sam Apr 01 '19 at 13:27
**Input** `Line1: john,science\,social,football,florence_school Line2: james,painting,tennis\,ping_pong\,chess,highmount_school Line3: robert,,snooker,ridgemont Line4: jim,geography,,oakmont` – Lax_Sam Apr 01 '19 at 13:31
Isn't the second line of output (third line of input) suppose to be `tennis, ping_pong, chess`? – potong Apr 01 '19 at 15:16
@potong it's not an example output. As you can see 2nd field of 2nd line and 3rd field of 3rd line contain escaped commas and OP says they must be parsed like that – oguz ismail Apr 01 '19 at 15:23
@oguzismail your code logic could fail if two backslashes precede a comma, if the op want to escape the backslash. – jxc Apr 01 '19 at 15:47
@jxc why would OP want to escape the escape character? – oguz ismail Apr 01 '19 at 15:52
@oguzismail It is not uncommon in the real life data processing to escape the escape char. Probably not a concern for op, but better to know the potential issues. – jxc Apr 01 '19 at 16:02
@jxc Ok, if it is a concern for OP and he lets me know about that I will update my answer. – oguz ismail Apr 01 '19 at 16:07

score 2 · Answer 2 · answered Apr 01 '19 at 13:07

You could replace the \, sequences by another character that won't appear in your text, split the text around the remaining commas then replace the chosen character by commas :

sed $'s/\\\,/\31/g' input | awk -F, '{ printf "Name: %s\nSubjects : %s\nSports: %s\nSchool: %s\n\n", $1, $2, $3, $4 }' | tr $'\31' ','

In this case using the ASCII control char "Unit Separator" \31 which I'm pretty sure your input won't contain.

You can try it here.

score 0 · Answer 3 · answered Apr 01 '19 at 13:16

Why awk and sed when bash with coreutils is just enough:

# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
     # read the \xff separated list into an array
     IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
     # read the \xff separated list into an array
     IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")

     echo "list_of_subjects : ${list_of_subjects[@]}"
     echo "list_of_sports   : ${list_of_sports[@]}"
done

will output:

list_of_subjects : science social
list_of_sports   : football
list_of_subjects : painting
list_of_sports   : tennis ping_pong chess

Note that this will be most probably slower then solution using awk.

Note that the principle of operation is the same as in other answers - substitute \, string by some other unique character and then use that character to iterate over the second and third field elemetns.

wrt `Why awk and sed when bash with coreutils is just enough` - because doing it with a bash loop would take more code, be more complicated, harder to write robustly, be less portable, and be far slower than doing it with awk. The guys who invented shell also invented awk for shell to call to manipulate text - they had their reasons... — Ed Morton, Apr 01 '19 at 18:03

ghoti · Answer 4 · 2019-04-01T15:28:25.223

You can perhaps join columns with a function.

function joincol(col,    i) {
    $col=$col FS $(col+1)
    for (i=col+1; i<NF; i++) {
        $i=$(i+1)
    }
    NF--
}

This might get used thusly:

{
    for (col=1; col<=NF; col++) {
        if ($col ~ /\\$/) {
            joincol(col)
        }
    }
}

Note that decrementing NF is undefined behaviour in POSIX. It may delete the last field, or it may not, and still be POSIX compliant. This works for me in BSDawk and Gawk. YMMV. May contain nuts.

jxc · Answer 5 · 2019-04-01T15:52:13.383

0

Use gawk's FPAT:

awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess

then use gnusub to replace the backslashes:

awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess

edited Apr 01 '19 at 15:52

answered Apr 01 '19 at 13:38

jxc

13,553
4
16
34

2

**Warning**: `FPAT` and `gensub` are gawk-specific features. – oguz ismail Apr 01 '19 at 15:24

score 0 · Answer 6 · edited Apr 01 '19 at 15:26

0

Using Perl. Change the \, to some control char say \x01 and then replace it again with ,

$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt  | perl -F, -lane ' for(@F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess

edited Apr 01 '19 at 15:26

oguz ismail

1
16
47
69

answered Apr 01 '19 at 13:43

stack0114106

8,534
3
13
38

score 0 · Answer 7 · answered Apr 01 '19 at 15:10

This might work for you (GNU sed):

sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file

Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.

Ignore comma after backslash in a line in a text file using awk or sed

7 Answers7