0

I have some text inside of a span tag in an html file

I need to extract it, I tried this so far but it doesn't seem to work:

Html:

"<span id=\"MainContent_lblGenAssessment\">$866,250</span></dd>"

I tried this:

gsub(x = "<span id=\"MainContent_lblGenAssessment\">$866,250</span></dd>"r,pattern = ">(.*?)<",replacement = "\\1")

But it seems useless, How can I extract the 866,250?

Edit: it must use the default R libraries, I can't install any packages.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Kevin
  • 3,077
  • 6
  • 31
  • 77
  • 1
    "I can't install any packages" That's extremely unlikely. – Roland Apr 28 '16 at 07:03
  • Obligatory link to canonical question on the subject: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – sleske Apr 28 '16 at 13:29
  • Regex should not be used on html. The proper way to do this would be to install an html parsing package and do it properly. I recommend `XML::xmlValue` – Rich Scriven Apr 28 '16 at 13:51

2 Answers2

5

The right way to do this is to parse the HTML with a parser, like so:

library(rvest)
x %>% read_html() %>% html_text()
# [1] "$866,250"

If you must do it with regex (a very bad idea if if it's for a lot of data or it's otherwise hard to inspect the result, e.g. in programmatic usage), you could do it with:

sub('.*>([^<]+)<.*', '\\1', x)
# [1] "$866,250"

If that span tag is in the middle of a lot more HTML, you'll have to add more regex to specify.

The regex looks for

  • any character . repeated 0 or more times *,
  • followed by >
  • followed by a capturing group ( ... )
    • containing any character except [^ ... ]
      • a <
    • repeated one or more times +
  • followed by <
  • followed by any character . repeated 0 or more times *,

and replaces it with the first captured group, \\1.

alistaire
  • 42,459
  • 4
  • 77
  • 117
  • This worked, can you explain the regex please. This text is in a particular id which is unique. – Kevin Apr 28 '16 at 05:33
  • @Kevin Edited to explain. It matches the whole line, but only captures what's between `>` and `<`, and replaces the whole thing with what's captured. – alistaire Apr 28 '16 at 05:40
2

try this :

([\d,]*)<\/span>

Assuming that every number you want to extract are inside the <span> tag

JanLeeYu
  • 981
  • 2
  • 9
  • 24