-2

I have the following html:

<html>
  <head></head>
  <body>
     <span class="hello-style" id="hello123">
        hello world
     </span>
     <span class="value-style">
        1000
     </span>
     <span class="value-style">
        2000
     </span>
     <span class="value-style">
        3000
     </span>
  </body>
</html>

I would like to match each value after <span class="value-style"> that can be anything, so the output from the above example should be:
1000
2000
3000

This should at least remove all non-numeric values, but it does not:
curl 127.0.0.1/index.html | sed 's/[a-zA-Z]/""/'

EDIT

curl 127.0.0.1/index.html | tr -d '\n' | sed '...'

Rox
  • 2,647
  • 15
  • 50
  • 85
  • 2
    https://stackoverflow.com/a/1732454/3772221 – m0meni Aug 15 '17 at 18:47
  • Well, even after removnig all line breaks so it all appears like a simple string (see my edit) it should be possible to match the values after span elements with class "value-style": `` – Rox Aug 15 '17 at 19:08

2 Answers2

1

awk to the rescue!

$ awk '/<\/span/{f=0} f; /<span class="value-style"/{f=1}' file

    1000
    2000
    3000

extract lines between the patterns.

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Thanks! What exactly does `/<\/span/{f=0} f;` and the section after the semi colon in the awk-command? Does is first filter out all `span` rows? – Rox Aug 16 '17 at 06:56
1

You shouldn't parse html/xml content with awk/sed tools.
The right way is using xml/html parsers, like xmlstarlet:

xmlstarlet sel -t -v '//span[@class="value-style"]' -n index.html | grep -o '[^[:space:]]*'

The output:

1000
2000
3000

  • //span[@class="value-style"] - xpath expression to select only span tags (with specified attribute class) values

  • grep -o '[^[:space:]]*' - extract non-whitespace values from the output

RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105