sed split single line file and process resulting lines

Question

I have an XML feed (this) in a single line so to extract the data I need I can do something like this:

sed -r 's:<([^>]+)>([^<]+)</\1>:&\n: g' feed | sed -nr '
    /<item>/, $ s:.*<(title|link|description)>([^<]+)</\1>.*:\2: p'

since I can't find a way to make first sed call to process result as different lines.

Any advice?

My goal is to get all data I need in a single sed call

well, you can chain expressions. `sed -e 's/foo//' -e 's/bar//'` will first remove foo and then bar. — Mel, May 21 '11 at 21:39
Even if I add `\n` sed will keep on processing everything as a single line — neurino, May 21 '11 at 21:40
Obligatory [Don't parse XML with RegEx Link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — johnsyweb, May 22 '11 at 00:37
@Johnsyweb: you are right but I just need to download some podcasts, don't need a 100% error free solution so regex will perfectly fit and `sed` is available almost everywhere while _more reccomended tools_ are not. My bash script, using my answer below, is 35 lines long (comments included) and is now downloading mp3s I wanted. — neurino, May 22 '11 at 01:11

score 2 · Accepted Answer · answered May 21 '11 at 23:35

2

sed -rn -e 's|>[[:space:]]*<|>\n<|g
/^<title>/ { bx }
/^<description>/ { b x }
/^<link>/ { bx }
D
:x
s|<([^>]*)>([^\n]*)</\1>|\1=\2|;
P
D' rss.xml

New answer to new question. Now with branches and outputing all three chunks of information.

answered May 21 '11 at 23:35

Seth Robertson

30,608
7
64
57

We came up with a similar solution using hold buffer, what is `:x` label for? – neurino May 21 '11 at 23:57
Accepted, as long you group 3 `/^/ { bx }` lines in a single `/^<(title|description|link)>/ { bx }` :)) – neurino May 22 '11 at 18:44

Seth Robertson · Answer 2 · 2011-05-21T23:12:06.770

sed -rn -e 's|>[[:space:]]*<|>\n<|g   # Insert newlines before each element
/^[^<]/ D                             # If not starting with <, delete until 1st \n and restart
/^<[^t]/ D                            # If not starting with <t, ""
/^<t[^i]/ D                           # If not starting with <ti, ""
/^<ti[^t]/ D
/^<tit[^l]/ D
/^<titl[^e]/ D
/^<title[^>]/ D                       # If not starting with <title>, delete until 1st \n and restart
s|^<title>||                          # Delete <title>
s|</title>[^\n]*||                    # Delete </title> and everything after it until the newline
P                                     # Print everything up to the first newline
D' rss.xml                            # Delete everything up to the first newline and restart

By "restart" I mean go back to the top of the sed script and pretend we just read whatever is left.

I learned a lot about sed writing this. However, there is zero question that you really should be doing this in perl (or awk if you are old school).

In perl, this would be perl -pe 's%.*?<title>(.*?)</title>(?:.*?(?=<title>)|.*)%$1\n%g' rss.xml

Which is basically taking advantage of the minimal match (.*? is non-greedy, it will match the fewest number of character possible). The positive lookahead thing at the end is just so that I could do it in one s expression while still deleting everything at the end. There is more than one way…

If you needed multiple tags out of this xml file, it probably is still possible, but would probably involve branching and the like.

OMG can you explain this? Moreover I have no only titles to extract but also link, descriptions... I'll stick on my turnaround if this is the only solution. As I read in other questions on SO sed stays for _streamline editor_ and not _multiline editor_, I think it's appropriate for my task, it would definitively be **the tool** if my XML had newlines. Is this simple difference worth using a total different approach? Thanks for your effort anyway. — neurino, May 21 '11 at 22:36
@neurino: I updated my post to include an explanation and one example perl solution. Yes, I would absolutely do this in perl instead of horribly complex sed scripts or sed pipelines. Oh, and I seriously think I deserve an upvote and accept even if you end up not using this due to complexity and since I was able to solve the problem as written. — Seth Robertson, May 21 '11 at 23:14
upvote for sure, I'll wait a little to accept your answer to see if some other solution comes up. I updated my question — neurino, May 21 '11 at 23:22
just a note, if I'll go with another tool I'll save me troubles and parse xml with python... ;) — neurino, May 21 '11 at 23:25

neurino · Answer 3 · 2011-05-22T18:45:28.417

0

What about this:

sed -nr 's|>[[:space:]]*<|>\n<|g
    h
    /^<(title|link|description)>/ {
        s:<([^>]+)>([^<]+)</\1>:\2: P
    }
    g
    D
    ' feed

edited May 22 '11 at 18:45

answered May 21 '11 at 23:48

neurino

11,500
2
40
63

sed split single line file and process resulting lines

3 Answers3