how to grep part of the content from a string in bash

Question

For example when filtering html file, if every line is in this kind of pattern:

<a href="xxxxxx" style="xxxx"><i>some text</i></a>

how can I get the content of href, and how can I get the text between <i> and </i>?

Use xmlstarlet. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Ignacio Vazquez-Abrams, Dec 21 '10 at 05:15
@Ignacio Vazquez-Abrams: Does xmlstarlet work with HTML too? — Gumbo, Dec 21 '10 at 05:32
@Gumbo: You'd have to shove it through HTML Tidy first, but that's not too big a deal. And it's more a matter of the option not existing, not the underlying libraries being unable to handle it. — Ignacio Vazquez-Abrams, Dec 21 '10 at 05:33

score 1 · Accepted Answer · edited May 23 '17 at 09:58

1

cat file | cut -f2 -d\"

FYI: Just about every other HTML/regexp post on Stackoverflow explains why getting values from HTML using anything other than HTML parsing is a bad idea. You may want to read some of those. This one for example.

edited May 23 '17 at 09:58

Community

1
1

answered Dec 21 '10 at 05:17

dietbuddha

8,556
1
30
34

score 0 · Answer 2 · answered Dec 21 '10 at 05:16

0

If href is always the second token separated by space in a,ine then u can try

grep "href" file | cut -d' ' -f2 | cut -d'=' -f2

answered Dec 21 '10 at 05:16

Raghuram

3,937
2
19
25

score 0 · Answer 3 · answered Mar 12 '11 at 19:52

Here's how to do it using xmlstarlet (optionally with tidy):

# extract content of href and <i>...</i>
echo '<a href="xxxxxx" style="xxxx"><i>some text</i></a>' |
xmlstarlet sel -T -t -m "//a" -v @href -n -v i -n

# using tidy & xmlstarlet
echo '<a href="xxxxxx" style="xxxx"><i>some text</i></a>' |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null | 
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:a" -v @href -n -v . -n

how to grep part of the content from a string in bash

3 Answers3