-1

Given a delimiter separated file for example things.file containing,

universe {planets;stars;people}, planet {countries; restaurants}, sky {clouds; planes}
table {dishes}, chair {butts; more butts}, face {eyes; mouths}
computers {memories; processors}, screens {good images; bad images; ugly images}, dogs {tails; fun }

I find myself doing

$ awk -F"," '{print $2}' things.file | awk -F"{|}" '{print $2}' | awk -F";" '{print $1}'
countries
butts
good images

to get the fields within fields. Is there a cleaner way to do it, i.e., without calling awk thrice?

Inian
  • 80,270
  • 14
  • 142
  • 161
myradio
  • 1,703
  • 1
  • 15
  • 25
  • Can there be `{`, `}`, or `"` or newlines within a pair of `{...}`s or anywhere else between `,`s? – Ed Morton Sep 20 '19 at 14:14
  • @EdMorton yes, this was a nice example but in my case can indeed be `{`, `}` and `"` anywhere else as well. – myradio Sep 25 '19 at 20:56
  • I wonder if whoever downvoted has a comment. – myradio Sep 25 '19 at 20:57
  • I downvoted because after several days you still hadn't provided all the necessary information, weren't responding to questions, and weren't providing feedback on any of the answers. – Ed Morton Sep 26 '19 at 14:24
  • Ok fair enough. I was testing the answers as much as I could. I have sometimes small changes, as i commented on @inian answer sometimes the `;` and `,` are exchanged. – myradio Sep 27 '19 at 07:13

3 Answers3

2

Fewer calls to awk and just one call to the split() function, you can do as below.

awk -v FS=, '{ split($2, arr, /[{};]/);  print arr[2] }' file

The split() function on $2 delimits text based on the regex provided in the last argument [{};], i.e. to split words if any of those characters appear. The words split are stored int the array arr from which you can retrieve the words of your choice.

If the leading and trailing spaces are to be removed, add a substitute function to replace it as below. Add the same after the call to split() and before the print

sub(/^[[:space:]]*|[[:space:]]*$/, "", arr[3])
Inian
  • 80,270
  • 14
  • 142
  • 161
  • Looks nice but it is a bit slow. – myradio Sep 26 '19 at 05:54
  • @myradio the scripts in this answer may or may not work for you but they will **NOT** be slow. – Ed Morton Sep 26 '19 at 14:35
  • @EdMorton indeed this is super fast actually, and gives exactly what I need. I was not seeing that I was getting an error because I was missing the `"` in the field separator (In some files I have `;` and `,` exchanged). This does the trick: `awk -v FS=";" '{ split($2, arr, /[{},]/); for (i in arr) {printf("%s ", arr[i])}; printf("\n") } file.dsv'` – myradio Sep 27 '19 at 07:10
1

EDIT:

awk -F, '{gsub(/.*{|}/,"",$2);gsub(/; /,ORS,$2);print $2}'  Input_file


Could you please try following, we could do this in a single awk.

awk -F,  'match($2,/{[^;]*/){print substr($2,RSTART+1,RLENGTH-1)}' Input_file

Explanation: Setting field separator as comma for all lines of Input_file. Using match out of the box function of awk where giving regex for column 2nd selecting everything from { to till first occurrence of ; Then printing substring whose starting point is variable RSTART till value of RLENGTH, where RSTART, RLENGTH variables will be set once regex is found in march function.

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

Making a bunch of assumptions about what characters your fields can contain (i.e. that they always look just like your example), here's how to parse the data such that you can do anything you like with it in future:

$ cat tst.awk
BEGIN { FS="[[:space:]]*,[[:space:]]*"; OFS="\t" }
{
    for (i=1; i<=NF; i++) {
        head = tail = $i
        sub(/[[:space:]]*{.*/,"",head)
        gsub(/.*{[[:space:]]*|[[:space:]]*}[[:space:]]*$/,"",tail)

        n = split(tail,subFlds,/[[:space:]]*;[[:space:]]*/)

        print "field:", $i
        print "head:", head
        print "tail:", tail
        for (j=1; j<=n; j++) {
            print "sub " j ":", subFlds[j]
        }
        print "\n------------\n"
    }
    print "############\n"
}

.

$ awk -f tst.awk file
field:  universe {planets;stars;people}
head:   universe
tail:   planets;stars;people
sub 1:  planets
sub 2:  stars
sub 3:  people

------------

field:  planet {countries; restaurants}
head:   planet
tail:   countries; restaurants
sub 1:  countries
sub 2:  restaurants

------------

field:  sky {clouds; planes}
head:   sky
tail:   clouds; planes
sub 1:  clouds
sub 2:  planes

------------

############

field:  table {dishes}
head:   table
tail:   dishes
sub 1:  dishes

------------

field:  chair {butts; more butts}
head:   chair
tail:   butts; more butts
sub 1:  butts
sub 2:  more butts

------------

field:  face {eyes; mouths}
head:   face
tail:   eyes; mouths
sub 1:  eyes
sub 2:  mouths

------------

############

field:  computers {memories; processors}
head:   computers
tail:   memories; processors
sub 1:  memories
sub 2:  processors

------------

field:  screens {good images; bad images; ugly images}
head:   screens
tail:   good images; bad images; ugly images
sub 1:  good images
sub 2:  bad images
sub 3:  ugly images

------------

field:  dogs {tails; fun }
head:   dogs
tail:   tails; fun
sub 1:  tails
sub 2:  fun

------------

############

For more robust parsing of CSVs (your sample just appears to use {...} where a regular CSV uses "...") in general with awk, see What's the most robust way to efficiently parse CSV using awk?

Ed Morton
  • 188,023
  • 17
  • 78
  • 185