0

I have an XML feed (this) in a single line so to extract the data I need I can do something like this:

sed -r 's:<([^>]+)>([^<]+)</\1>:&\n: g' feed | sed -nr '
    /<item>/, $ s:.*<(title|link|description)>([^<]+)</\1>.*:\2: p'

since I can't find a way to make first sed call to process result as different lines.

Any advice?

My goal is to get all data I need in a single sed call

neurino
  • 11,500
  • 2
  • 40
  • 63
  • well, you can chain expressions. `sed -e 's/foo//' -e 's/bar//'` will first remove foo and then bar. – Mel May 21 '11 at 21:39
  • Even if I add `\n` sed will keep on processing everything as a single line – neurino May 21 '11 at 21:40
  • 1
    Obligatory [Don't parse XML with RegEx Link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – johnsyweb May 22 '11 at 00:37
  • @Johnsyweb: you are right but I just need to download some podcasts, don't need a 100% error free solution so regex will perfectly fit and `sed` is available almost everywhere while _more reccomended tools_ are not. My bash script, using my answer below, is 35 lines long (comments included) and is now downloading mp3s I wanted. – neurino May 22 '11 at 01:11

3 Answers3

2
sed -rn -e 's|>[[:space:]]*<|>\n<|g
/^<title>/ { bx }
/^<description>/ { b x }
/^<link>/ { bx }
D
:x
s|<([^>]*)>([^\n]*)</\1>|\1=\2|;
P
D' rss.xml

New answer to new question. Now with branches and outputing all three chunks of information.

Seth Robertson
  • 30,608
  • 7
  • 64
  • 57
1
sed -rn -e 's|>[[:space:]]*<|>\n<|g   # Insert newlines before each element
/^[^<]/ D                             # If not starting with <, delete until 1st \n and restart
/^<[^t]/ D                            # If not starting with <t, ""
/^<t[^i]/ D                           # If not starting with <ti, ""
/^<ti[^t]/ D
/^<tit[^l]/ D
/^<titl[^e]/ D
/^<title[^>]/ D                       # If not starting with <title>, delete until 1st \n and restart
s|^<title>||                          # Delete <title>
s|</title>[^\n]*||                    # Delete </title> and everything after it until the newline
P                                     # Print everything up to the first newline
D' rss.xml                            # Delete everything up to the first newline and restart

By "restart" I mean go back to the top of the sed script and pretend we just read whatever is left.

I learned a lot about sed writing this. However, there is zero question that you really should be doing this in perl (or awk if you are old school).

In perl, this would be perl -pe 's%.*?<title>(.*?)</title>(?:.*?(?=<title>)|.*)%$1\n%g' rss.xml

Which is basically taking advantage of the minimal match (.*? is non-greedy, it will match the fewest number of character possible). The positive lookahead thing at the end is just so that I could do it in one s expression while still deleting everything at the end. There is more than one way…

If you needed multiple tags out of this xml file, it probably is still possible, but would probably involve branching and the like.

Seth Robertson
  • 30,608
  • 7
  • 64
  • 57
  • OMG can you explain this? Moreover I have no only titles to extract but also link, descriptions... I'll stick on my turnaround if this is the only solution. As I read in other questions on SO sed stays for _streamline editor_ and not _multiline editor_, I think it's appropriate for my task, it would definitively be **the tool** if my XML had newlines. Is this simple difference worth using a total different approach? Thanks for your effort anyway. – neurino May 21 '11 at 22:36
  • @neurino: I updated my post to include an explanation and one example perl solution. Yes, I would absolutely do this in perl instead of horribly complex sed scripts or sed pipelines. Oh, and I seriously think I deserve an upvote and accept even if you end up not using this due to complexity and since I was able to solve the problem as written. – Seth Robertson May 21 '11 at 23:14
  • upvote for sure, I'll wait a little to accept your answer to see if some other solution comes up. I updated my question – neurino May 21 '11 at 23:22
  • just a note, if I'll go with another tool I'll save me troubles and parse xml with python... ;) – neurino May 21 '11 at 23:25
0

What about this:

sed -nr 's|>[[:space:]]*<|>\n<|g
    h
    /^<(title|link|description)>/ {
        s:<([^>]+)>([^<]+)</\1>:\2: P
    }
    g
    D
    ' feed
neurino
  • 11,500
  • 2
  • 40
  • 63