Regex: Match numeric values after an element in html

Question

I have the following html:

<html>
  <head></head>
  <body>
     <span class="hello-style" id="hello123">
        hello world
     </span>
     <span class="value-style">
        1000
     </span>
     <span class="value-style">
        2000
     </span>
     <span class="value-style">
        3000
     </span>
  </body>
</html>

I would like to match each value after <span class="value-style"> that can be anything, so the output from the above example should be:
1000
2000
3000

This should at least remove all non-numeric values, but it does not:
curl 127.0.0.1/index.html | sed 's/[a-zA-Z]/""/'

EDIT

curl 127.0.0.1/index.html | tr -d '\n' | sed '...'

Well, even after removnig all line breaks so it all appears like a simple string (see my edit) it should be possible to match the values after span elements with class "value-style": `` — Rox, Aug 15 '17 at 19:08

score 1 · Accepted Answer · answered Aug 15 '17 at 19:14

1

awk to the rescue!

$ awk '/<\/span/{f=0} f; /<span class="value-style"/{f=1}' file

    1000
    2000
    3000

extract lines between the patterns.

answered Aug 15 '17 at 19:14

karakfa

66,216
7
41
56

Thanks! What exactly does `/<\/span/{f=0} f;` and the section after the semi colon in the awk-command? Does is first filter out all `span` rows? – Rox Aug 16 '17 at 06:56

score 1 · Answer 2 · answered Aug 15 '17 at 20:01

You shouldn't parse html/xml content with awk/sed tools.
The right way is using xml/html parsers, like xmlstarlet:

xmlstarlet sel -t -v '//span[@class="value-style"]' -n index.html | grep -o '[^[:space:]]*'

The output:

1000
2000
3000

//span[@class="value-style"] - xpath expression to select only span tags (with specified attribute class) values
grep -o '[^[:space:]]*' - extract non-whitespace values from the output

Regex: Match numeric values after an element in html

2 Answers2