0

For example when filtering html file, if every line is in this kind of pattern:

<a href="xxxxxx" style="xxxx"><i>some text</i></a>

how can I get the content of href, and how can I get the text between <i> and </i>?

sth
  • 222,467
  • 53
  • 283
  • 367
jojo
  • 13,583
  • 35
  • 90
  • 123

3 Answers3

1

cat file | cut -f2 -d\"

FYI: Just about every other HTML/regexp post on Stackoverflow explains why getting values from HTML using anything other than HTML parsing is a bad idea. You may want to read some of those. This one for example.

Community
  • 1
  • 1
dietbuddha
  • 8,556
  • 1
  • 30
  • 34
0

If href is always the second token separated by space in a,ine then u can try

grep "href" file | cut -d' ' -f2 | cut -d'=' -f2

Raghuram
  • 3,937
  • 2
  • 19
  • 25
0

Here's how to do it using xmlstarlet (optionally with tidy):

# extract content of href and <i>...</i>
echo '<a href="xxxxxx" style="xxxx"><i>some text</i></a>' |
xmlstarlet sel -T -t -m "//a" -v @href -n -v i -n

# using tidy & xmlstarlet
echo '<a href="xxxxxx" style="xxxx"><i>some text</i></a>' |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null | 
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:a" -v @href -n -v . -n
tommy
  • 1