Parse fields within fields with awk

Question

Given a delimiter separated file for example things.file containing,

universe {planets;stars;people}, planet {countries; restaurants}, sky {clouds; planes}
table {dishes}, chair {butts; more butts}, face {eyes; mouths}
computers {memories; processors}, screens {good images; bad images; ugly images}, dogs {tails; fun }

I find myself doing

$ awk -F"," '{print $2}' things.file | awk -F"{|}" '{print $2}' | awk -F";" '{print $1}'
countries
butts
good images

to get the fields within fields. Is there a cleaner way to do it, i.e., without calling awk thrice?

Can there be `{`, `}`, or `"` or newlines within a pair of `{...}`s or anywhere else between `,`s? — Ed Morton, Sep 20 '19 at 14:14
@EdMorton yes, this was a nice example but in my case can indeed be `{`, `}` and `"` anywhere else as well. — myradio, Sep 25 '19 at 20:56
I downvoted because after several days you still hadn't provided all the necessary information, weren't responding to questions, and weren't providing feedback on any of the answers. — Ed Morton, Sep 26 '19 at 14:24
Ok fair enough. I was testing the answers as much as I could. I have sometimes small changes, as i commented on @inian answer sometimes the `;` and `,` are exchanged. — myradio, Sep 27 '19 at 07:13

Inian · Accepted Answer · 2019-09-20T08:23:13.633

2

Fewer calls to awk and just one call to the split() function, you can do as below.

awk -v FS=, '{ split($2, arr, /[{};]/);  print arr[2] }' file

The split() function on $2 delimits text based on the regex provided in the last argument [{};], i.e. to split words if any of those characters appear. The words split are stored int the array arr from which you can retrieve the words of your choice.

If the leading and trailing spaces are to be removed, add a substitute function to replace it as below. Add the same after the call to split() and before the print

sub(/^[[:space:]]*|[[:space:]]*$/, "", arr[3])

edited Sep 20 '19 at 08:23

answered Sep 20 '19 at 08:14

Inian

80,270
14
142
161

Looks nice but it is a bit slow. – myradio Sep 26 '19 at 05:54
@myradio the scripts in this answer may or may not work for you but they will **NOT** be slow. – Ed Morton Sep 26 '19 at 14:35
@EdMorton indeed this is super fast actually, and gives exactly what I need. I was not seeing that I was getting an error because I was missing the `"` in the field separator (In some files I have `;` and `,` exchanged). This does the trick: `awk -v FS=";" '{ split($2, arr, /[{},]/); for (i in arr) {printf("%s ", arr[i])}; printf("\n") } file.dsv'` – myradio Sep 27 '19 at 07:10

RavinderSingh13 · Answer 2 · 2019-09-27T07:20:44.947

1

EDIT:

awk -F, '{gsub(/.*{|}/,"",$2);gsub(/; /,ORS,$2);print $2}'  Input_file

Could you please try following, we could do this in a single awk.

awk -F,  'match($2,/{[^;]*/){print substr($2,RSTART+1,RLENGTH-1)}' Input_file

Explanation: Setting field separator as comma for all lines of Input_file. Using match out of the box function of awk where giving regex for column 2nd selecting everything from { to till first occurrence of ; Then printing substring whose starting point is variable RSTART till value of RLENGTH, where RSTART, RLENGTH variables will be set once regex is found in march function.

edited Sep 27 '19 at 07:20

answered Sep 20 '19 at 08:29

RavinderSingh13

130,504
14
57
93

This gives me only the first subfield, as you said whatever is between `{` and `;` – myradio Sep 27 '19 at 07:03
@myradio, I meant from `{` to till first occurrence of `;` it will provide it. – RavinderSingh13 Sep 27 '19 at 07:06
That is correct but how can I use this to get all the subfields between the `;` between the `{`, `}`? – myradio Sep 27 '19 at 07:14
@myradio, Glad that it helped you, you could select this as a correct answer too :) – RavinderSingh13 Sep 27 '19 at 09:51

Ed Morton · Answer 3 · 2019-09-20T14:41:31.000

Making a bunch of assumptions about what characters your fields can contain (i.e. that they always look just like your example), here's how to parse the data such that you can do anything you like with it in future:

$ cat tst.awk
BEGIN { FS="[[:space:]]*,[[:space:]]*"; OFS="\t" }
{
    for (i=1; i<=NF; i++) {
        head = tail = $i
        sub(/[[:space:]]*{.*/,"",head)
        gsub(/.*{[[:space:]]*|[[:space:]]*}[[:space:]]*$/,"",tail)

        n = split(tail,subFlds,/[[:space:]]*;[[:space:]]*/)

        print "field:", $i
        print "head:", head
        print "tail:", tail
        for (j=1; j<=n; j++) {
            print "sub " j ":", subFlds[j]
        }
        print "\n------------\n"
    }
    print "############\n"
}

.

$ awk -f tst.awk file
field:  universe {planets;stars;people}
head:   universe
tail:   planets;stars;people
sub 1:  planets
sub 2:  stars
sub 3:  people

------------

field:  planet {countries; restaurants}
head:   planet
tail:   countries; restaurants
sub 1:  countries
sub 2:  restaurants

------------

field:  sky {clouds; planes}
head:   sky
tail:   clouds; planes
sub 1:  clouds
sub 2:  planes

------------

############

field:  table {dishes}
head:   table
tail:   dishes
sub 1:  dishes

------------

field:  chair {butts; more butts}
head:   chair
tail:   butts; more butts
sub 1:  butts
sub 2:  more butts

------------

field:  face {eyes; mouths}
head:   face
tail:   eyes; mouths
sub 1:  eyes
sub 2:  mouths

------------

############

field:  computers {memories; processors}
head:   computers
tail:   memories; processors
sub 1:  memories
sub 2:  processors

------------

field:  screens {good images; bad images; ugly images}
head:   screens
tail:   good images; bad images; ugly images
sub 1:  good images
sub 2:  bad images
sub 3:  ugly images

------------

field:  dogs {tails; fun }
head:   dogs
tail:   tails; fun
sub 1:  tails
sub 2:  fun

------------

############

For more robust parsing of CSVs (your sample just appears to use {...} where a regular CSV uses "...") in general with awk, see What's the most robust way to efficiently parse CSV using awk?

Parse fields within fields with awk

3 Answers3