Getting text between first occurance of two strings [Shell]

Question

I have a feed.xml file that looks something like this. What I want to do is to grab the test.html from this feed.(Basically, the top most item's url.) Any thoughts on how to do this?

<rss>
<item>
    <title>ABC</title>
    <url>
        test.html
    </url>
</item>
<item>
    <title>CDE</title>
    <url>
        test1.html
    </url>
</item>
</rss>

Thanks!

Why do you want to do in shell only? Better to use some XML parser. — anubhava, Apr 11 '12 at 20:19
It's a part of a much bigger post build script and needs to be done in shell unfortunately. — user754905, Apr 11 '12 at 20:21
Have your script invoke a program that uses an XML parser. Really. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — geekosaur, Apr 11 '12 at 20:27

Matthias · Accepted Answer · 2012-04-12T04:48:16.170

1

If the structure is fixed and you know that the URL has the postfix .html, you can simply do:

  cat <yourfile> | grep ".html" | head -n1

If you don't know the postfix (or the string "html" can exist before), you can do:

 cat <yourfile> | grep -A1 "<url>" | head -n2 | tail -n1

EDIT In case, the structure is not fixed (i.e., no newlines), there this

 cat <yourfile> | grep -o "<url>[^<]*</url>" | head -n1 | cut -d'>' -f2 | cut -d'<' -f1

or that

 cat <yourfile> | grep -o "<url>[^<]*</url>" | head -n1 | sed -E -e"s#<url>(.*)</url>#\1#"

may work.

edited Apr 12 '12 at 04:48

answered Apr 11 '12 at 20:29

Matthias

8,018
2
27
53

The .html thing works fine if the file looks exactly like the post above, but if there are no new line characters: ABCtest.htmlCDEtest1.html, It doesn't work. – user754905 Apr 11 '12 at 22:54
That is what I meant whith 'fixed structure'. I edited my answer for this case. – Matthias Apr 12 '12 at 04:50
If you have GNU grep, `grep -m 1 -o '[-_.A-Zza-z0-9]*\.html' file` give or take a few characters in the character class depending on what type of file name you are trying to match. – tripleee Apr 12 '12 at 04:57
@tripleee I know that the `cat`command is not really needed here, but it allows to write the needed stuff as a pipe, thus you can get your stream from somewhere else. The OP did not state, where it input comes from. – Matthias Apr 12 '12 at 04:58

score 1 · Answer 2 · answered Apr 11 '12 at 21:11

1

This might work for you:

 sed '/<url>/,/<\/url>/{//d;s/ *//;q};d' file.xml

answered Apr 11 '12 at 21:11

potong

55,640
6
51
83

Throws error: sed: 1: "//,/<\/url>/{//d;s ...": extra characters at the end of q command – user754905 Apr 11 '12 at 22:57
I'm not sure why but perhaps you are running an old version of sed, try: `sed -e '//,/<\/url>/!d' -e '//d' -e 's/ *//' -e 'q' file` – potong Apr 11 '12 at 23:45
I've no idea why. What OS and sed version are you using? I noticed you used `"`'s rather than `'`'s to surround the commands may be this has an unwanted side effect. – potong Apr 12 '12 at 00:37
Mac OSX 10.7. I did use '. The error just came back like that. $ sed -e '//,/<\/url>/!d' -e '//d' -e 's/ *//' -e 'q' test.xml – user754905 Apr 12 '12 at 00:45
I believe Mac use a BSD version of sed. I am on Linux, so I can't test any further. Good Luck! – potong Apr 12 '12 at 01:58

anubhava · Answer 3 · 2012-04-11T20:36:09.230

0

This awk script should work:

awk '/<url>/ && url==0 {url=1;next;} {if(url==1) {print;url=2;}}' file

EDIT:

Following grep command might also work:

grep -m 1 "^ *<url>" -A1 file | grep -v "<url>"

edited Apr 11 '12 at 20:36

answered Apr 11 '12 at 20:27

anubhava

761,203
64
569
643

What if the tag looks something like – user754905 Apr 11 '12 at 23:01
In that case just replace `` with `` in both commands. – anubhava Apr 12 '12 at 04:52

je4d · Answer 4 · 2012-04-11T23:14:00.757

0

Instead of using line-based tools, I'd suggest using an xsl transform to get the data you want out of the document without making assumptions about the way it's formatted.

If you save this to get-url.xsl:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <xsl:output method="text"/>
        <xsl:template match="/">
                    <xsl:value-of select="normalize-space(rss/item/url)"/>
        </xsl:template>
</xsl:stylesheet>

Then you can get the value of url from feed.xml like this:

$ xsltproc get-url.xsl feed.xml; echo
test.html
$

The extra echo is just there to give you a newline after the end of the output, to make it friendly for an interactive shell. Just remove it if you're assigning the result to a shell variable with $().

edited Apr 11 '12 at 23:14

answered Apr 11 '12 at 20:50

je4d

7,628
32
46

How would you create the xsl if the url tag was replaced with – user754905 Apr 11 '12 at 23:42
@user754905 if your input contains ``, then the root element of the input should contain the attribute `xmlns:bundle="..."`. You need to copy that `xmlns:bundle="..."` attribute into the root element of the xsl (i.e. the `xsl:stylesheet` element), and then replace `rss/item/url` with `rss/item/bundle:releaselink`. – je4d Apr 12 '12 at 23:35

Getting text between first occurance of two strings [Shell]

4 Answers4