0

Does anybody know which regex to use to extract this character stddata__2015_02_04 from this character "<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>" in R? You may assume that the begging stddata__201 is known, and only the ending changes from time to time.

Marcin
  • 7,834
  • 8
  • 52
  • 99
  • 3
    [Quote](http://stackoverflow.com/a/1732454/3521006): "HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. _Even Jon Skeet cannot parse HTML using regular expressions_. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp." [italics added] – talat Mar 05 '15 at 22:59

3 Answers3

3

If the input is:

x <- "<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>"

then use sub:

sub(".*(stddata__201[_0-9]+).*", "\\1", x)

giving:

[1] "stddata__2015_02_04"

Here is a visualization of the regular expression:

.*(stddata__201[_0-9]+).*

Regular expression visualization

Debuggex Demo

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
2

I tend to agree with the other posters, Regex is not the best way to do this. However, if you REALLY want to do this with Regex, here it goes.

(?<=>\s)([^<>\/])+        # Works in php and python, and most other languages
Blue0500
  • 715
  • 8
  • 16
1
> library("stringr")
> str_extract("<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>",
+             "stddata__201[0-9]_[0-9]{2}_[0-9]{2}")
[1] "stddata__2015_02_04"

preferred solution is not to regex...

> library("rvest")
> "<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>" %>% 
+   html() %>% 
+   html_text()
[1] " stddata__2015_02_04/"
cory
  • 6,529
  • 3
  • 21
  • 41