I have an xml file that 18M long containing a series of records for a media list. Size of the file is driven by the fact there is an embedded image for each record. The application that exported the list exported the entire list of records with no carriage returns. What I am trying to do is extract just the portion of each xml record that contains the description field for that record. A couple of sample records look like this:
<media>
<version>2</version>
<broadcasts><broadcast><name>KSBW-DT</name><description>KSBW-DT Channel 8</description><genre /><audio_only>False</audio_only><cover_art>THERE IS A MEGABYTE OF IMAGE DATA CONTAINED HERE IN EACH RECORD</cover_art><location>8</location><img>b4/b4098184-bc85-4b17-9a87-43d2c4a23bdd</img></broadcast>
<broadcast><name>Cntrl Coast</name><description>Cntrl Coast Channel 9</description><genre /><audio_only>False</audio_only><cover_art>THERE IS A MEGABYTE OF IMAGE DATA CONTAINED HERE IN EACH RECORD</cover_art><location>9</location><img>44/443f0080-ca33-4150-8873-23359d8999dd</img></broadcast>
...(680 more records)....
</broadcasts></media>
I've been playing with awk for a while now trying to extract just the 'description text' but have come up empty. Any patterns I try either return no match or the entire 18MB string with all the data in it. Here is the latest iteration of my awk script:
#!/bin/bash
set -x
awk '{
for( i=1; i<=NF; i++) {
tmp=match($i, /\<description\>.*\<\/description>)
if (tmp) {
print $i
}
}
}' $1
But this isn't returning anything. I've only used awk in the past where the text had typical lines in it, not a continuous 18MB string. Can awk even parse something like this? I've tried alternative methods suggested on the web- grep -o, sed, but I can't seem to get anything to work.
Appreciate any help or guidance.
Thanks.....