awk output with regex filter seems to skip indices

Question

I have a line like this in a file:

abc-content: ["afox","dfox","xfox","ufox","sdao","qusa","hero"]

And after playing around with awk, this seems to work fine to pull out all the strings from the line:

awk -F'[ :\[,"]' '{print $1, $5, $8, $11, $14, $17, $20, $23}' < file

If I use $2 or $3 or $4 instead of $5 (for example), I get a blank output. Can someone explain what could be happening here? Are my filters also occupying the intermediate indices?

Ed Morton · Accepted Answer · 2021-07-02T14:51:20.513

The null strings between your separators, e.g. between every " and , would be fields given that FS setting.

ITYM -F': [[]"|","|"]' or similar:

$ awk -F': [[]"|","|"]' '{for (i=1; i<=NF; i++) print i, "<" $i ">"}' file
1 <abc-content>
2 <afox>
3 <dfox>
4 <xfox>
5 <ufox>
6 <sdao>
7 <qusa>
8 <hero>
9 <>

or if you prefer not to have that null field after the last FS component ("]) then don't include it in the FS and just remove "] from the end of the record:

$ awk -F': [[]"|","' '{sub(/"]$/,""); for (i=1; i<=NF; i++) print i, "<" $i ">"}' file
1 <abc-content>
2 <afox>
3 <dfox>
4 <xfox>
5 <ufox>
6 <sdao>
7 <qusa>
8 <hero>

-F'(: [[]|",)"' would be another way to write that FS if you prefer.

Here's how your FS setting was splitting the record into fields:

$ awk -F'[ :[,"]' '{for (i=1; i<=NF; i++) print i, "<" $i ">"}' file
1 <abc-content>
2 <>
3 <>
4 <>
5 <afox>
6 <>
7 <>
8 <dfox>
9 <>
10 <>
11 <xfox>
12 <>
13 <>
14 <ufox>
15 <>
16 <>
17 <sdao>
18 <>
19 <>
20 <qusa>
21 <>
22 <>
23 <hero>
24 <]>

Trying to escape [ to include it in a bracket expression isn't necessary, by the way, like most chars [ is already literal inside a bracket expression and for those that aren't (], ^, and -), escaping them with a \ is the wrong thing to do to be POSIX compliant, you need to position them in the bracket expression as appropriate, see https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05.

The above would still fail given some [unlikely] combinations of chars within your quoted fields (e.g. if you had a field like "foo""," or "foo: [") because it potentially has all of the usual issues associated with CSV format. See What's the most robust way to efficiently parse CSV using awk? if that's a problem for your real data.

awk output with regex filter seems to skip indices

1 Answers1