Does anybody know which regex to use to extract this character stddata__2015_02_04
from this character "<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>"
in R? You may assume that the begging stddata__201
is known, and only the ending changes from time to time.
Asked
Active
Viewed 234 times
0

Marcin
- 7,834
- 8
- 52
- 99
-
3[Quote](http://stackoverflow.com/a/1732454/3521006): "HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. _Even Jon Skeet cannot parse HTML using regular expressions_. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp." [italics added] – talat Mar 05 '15 at 22:59
3 Answers
3
If the input is:
x <- "<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>"
then use sub
:
sub(".*(stddata__201[_0-9]+).*", "\\1", x)
giving:
[1] "stddata__2015_02_04"
Here is a visualization of the regular expression:
.*(stddata__201[_0-9]+).*

G. Grothendieck
- 254,981
- 17
- 203
- 341
2
I tend to agree with the other posters, Regex is not the best way to do this. However, if you REALLY want to do this with Regex, here it goes.
(?<=>\s)([^<>\/])+ # Works in php and python, and most other languages

Blue0500
- 715
- 8
- 16
1
> library("stringr")
> str_extract("<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>",
+ "stddata__201[0-9]_[0-9]{2}_[0-9]{2}")
[1] "stddata__2015_02_04"
preferred solution is not to regex...
> library("rvest")
> "<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>" %>%
+ html() %>%
+ html_text()
[1] " stddata__2015_02_04/"

cory
- 6,529
- 3
- 21
- 41
-
Thanks. The same operation works in stringi package stri_extract() but is faster :) – Marcin Mar 05 '15 at 23:03
-