Which regex to use in R?

Question

Does anybody know which regex to use to extract this character stddata__2015_02_04 from this character "<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>" in R? You may assume that the begging stddata__201 is known, and only the ending changes from time to time.

[Quote](http://stackoverflow.com/a/1732454/3521006): "HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. _Even Jon Skeet cannot parse HTML using regular expressions_. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp." [italics added] — talat, Mar 05 '15 at 22:59

score 3 · Accepted Answer · answered Mar 05 '15 at 23:09

If the input is:

x <- "<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>"

then use sub:

sub(".*(stddata__201[_0-9]+).*", "\\1", x)

giving:

[1] "stddata__2015_02_04"

Here is a visualization of the regular expression:

.*(stddata__201[_0-9]+).*

Regular expression visualization

Debuggex Demo

score 2 · Answer 2 · answered Mar 06 '15 at 00:25

2

I tend to agree with the other posters, Regex is not the best way to do this. However, if you REALLY want to do this with Regex, here it goes.

(?<=>\s)([^<>\/])+        # Works in php and python, and most other languages

answered Mar 06 '15 at 00:25

Blue0500

715
8
16

score 1 · Answer 3 · answered Mar 05 '15 at 23:00

1

> library("stringr")
> str_extract("<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>",
+             "stddata__201[0-9]_[0-9]{2}_[0-9]{2}")
[1] "stddata__2015_02_04"

preferred solution is not to regex...

> library("rvest")
> "<li><a href=\"stddata__2015_02_04/\"> stddata__2015_02_04/</a></li>" %>% 
+   html() %>% 
+   html_text()
[1] " stddata__2015_02_04/"

answered Mar 05 '15 at 23:00

cory

6,529
3
21
41

Thanks. The same operation works in stringi package stri_extract() but is faster :) – Marcin Mar 05 '15 at 23:03
oooh! the rvest package looks better :) ! great solution – Marcin Mar 06 '15 at 11:18

Which regex to use in R?

3 Answers3