Struggle with awk

Question

I have an xml file that 18M long containing a series of records for a media list. Size of the file is driven by the fact there is an embedded image for each record. The application that exported the list exported the entire list of records with no carriage returns. What I am trying to do is extract just the portion of each xml record that contains the description field for that record. A couple of sample records look like this:

<media>
  <version>2</version>
  <broadcasts><broadcast><name>KSBW-DT</name><description>KSBW-DT Channel 8</description><genre /><audio_only>False</audio_only><cover_art>THERE IS A MEGABYTE OF IMAGE DATA CONTAINED HERE IN EACH RECORD</cover_art><location>8</location><img>b4/b4098184-bc85-4b17-9a87-43d2c4a23bdd</img></broadcast>
  <broadcast><name>Cntrl Coast</name><description>Cntrl Coast Channel 9</description><genre /><audio_only>False</audio_only><cover_art>THERE IS A MEGABYTE OF IMAGE DATA CONTAINED HERE IN EACH RECORD</cover_art><location>9</location><img>44/443f0080-ca33-4150-8873-23359d8999dd</img></broadcast>
...(680 more records)....
</broadcasts></media>

I've been playing with awk for a while now trying to extract just the 'description text' but have come up empty. Any patterns I try either return no match or the entire 18MB string with all the data in it. Here is the latest iteration of my awk script:

#!/bin/bash
set -x
awk '{
        for( i=1; i<=NF; i++) {
                tmp=match($i, /\<description\>.*\<\/description>)
                if (tmp) {
                        print $i
                }
        }
}' $1

But this isn't returning anything. I've only used awk in the past where the text had typical lines in it, not a continuous 18MB string. Can awk even parse something like this? I've tried alternative methods suggested on the web- grep -o, sed, but I can't seem to get anything to work.

Appreciate any help or guidance.

Thanks.....

Thanks for showing your efforts in your question, could you please post sample of expected output too in your question for better understanding of question, thank you. — RavinderSingh13, Jan 27 '21 at 04:45
[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Jan 27 '21 at 04:54
another option is `xpath`, a perl program... see if `xpath -q -e '//description/text()' file` solves your problem, see https://stackoverflow.com/questions/15461737/how-to-execute-xpath-one-liners-from-shell — Sundeep, Jan 27 '21 at 04:57

Richard · Answer 1 · 2021-01-27T06:10:09.800

Define a new Record Separator that is appropriate for your data (which has no new line characters - the default in awk).

Consider using the < symbol for this purpose, which will be the start of every tag in the XML data.

Use a Field Separator to break apart each of the records (so it can be referred to with $1, $2 notation). In this case the > symbol is a good choice. Now, match $1 as the tag 'description' and $2 will be the value.

Now the awk command should look like this:

awk -v RS="<" -F ">" '{
    /^description$/ {print $2}
}' filename.xml

Struggle with awk

1 Answers1