Extract text out of '<' and '>'

Question

I have some text inside of a span tag in an html file

I need to extract it, I tried this so far but it doesn't seem to work:

Html:

"<span id=\"MainContent_lblGenAssessment\">$866,250</span></dd>"

I tried this:

gsub(x = "<span id=\"MainContent_lblGenAssessment\">$866,250</span></dd>"r,pattern = ">(.*?)<",replacement = "\\1")

But it seems useless, How can I extract the 866,250?

Edit: it must use the default R libraries, I can't install any packages.

Obligatory link to canonical question on the subject: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — sleske, Apr 28 '16 at 13:29
Regex should not be used on html. The proper way to do this would be to install an html parsing package and do it properly. I recommend `XML::xmlValue` — Rich Scriven, Apr 28 '16 at 13:51

alistaire · Accepted Answer · 2016-04-28T06:48:43.863

5

The right way to do this is to parse the HTML with a parser, like so:

library(rvest)
x %>% read_html() %>% html_text()
# [1] "$866,250"

If you must do it with regex (a very bad idea if if it's for a lot of data or it's otherwise hard to inspect the result, e.g. in programmatic usage), you could do it with:

sub('.*>([^<]+)<.*', '\\1', x)
# [1] "$866,250"

If that span tag is in the middle of a lot more HTML, you'll have to add more regex to specify.

The regex looks for

any character . repeated 0 or more times *,
followed by >
followed by a capturing group ( ... )
- containing any character except [^ ... ]
  - a <
- repeated one or more times +
followed by <
followed by any character . repeated 0 or more times *,

and replaces it with the first captured group, \\1.

edited Apr 28 '16 at 06:48

answered Apr 28 '16 at 05:31

alistaire

42,459
4
77
117

This worked, can you explain the regex please. This text is in a particular id which is unique. – Kevin Apr 28 '16 at 05:33
@Kevin Edited to explain. It matches the whole line, but only captures what's between `>` and `<`, and replaces the whole thing with what's captured. – alistaire Apr 28 '16 at 05:40

score 2 · Answer 2 · answered Apr 28 '16 at 05:20

2

try this :

([\d,]*)<\/span>

Assuming that every number you want to extract are inside the <span> tag

answered Apr 28 '16 at 05:20

JanLeeYu

981
2
9
24

Error: '\d' is an unrecognized escape in character string starting ""([\d" – Kevin Apr 28 '16 at 05:24
@Kevin - `\\d` in R regex. – thelatemail Apr 28 '16 at 05:25
@JanLeeYu here is the result: $866,250" It removed the span tag, I have no clue why R does this. – Kevin Apr 28 '16 at 05:26

Extract text out of '<' and '>'

2 Answers2