How can I extract URLs containing dashes from HTML using R?

Question

I have some HTML that looks like this:

<ul><li><a href="http://www.website.com/index.aspx" target="_blank">Website</a></li>
<li><a href="http://website.com/index.html" target="_blank">Website</a></li>
<li><a href="http://www.website-with-dashes.org" target="_blank">Website With Dashes</a></li>
<li><a href="http://website2.org/index.htm" target="_blank">Website 2</a></li>
<li><a href="http://www.another-site.com/">Another Site</a></li>

using

m<-regexpr("http://\\S*/?", links, perl=T)
links<-regmatches(links, m)

gets me the links, except the ones with dashes in them are truncated like this:

http://www.website.com/index.aspx
http://website.com/index.html
http://www.website
http://website2.org/index.htm
http://www.another-site.com/

I thought /S matched any non-whitespace. What's going on?

I can't replicate your issue. If I replace the `"` with `\"` so I can import the text with `readLines`, everything works as you intended. — thelatemail, Aug 22 '13 at 06:05

score 4 · Accepted Answer · edited May 23 '17 at 12:28

4

Use XML::getHTMLlinks

eg

library(XML)
# assuming your html document is'foo.html')

 getHTMLLinks(doc = 'foo.html')
# [1] "http://www.website.com/index.aspx"  "http://website.com/index.html"      "http://www.website-with-dashes.org"
# [4] "http://website2.org/index.htm"      "http://www.another-site.com/"

parsing HTML with regex not necessarily straightforward. https://stackoverflow.com/a/1732454/1385941 is and interesting read.

edited May 23 '17 at 12:28

Community

1
1

answered Aug 22 '13 at 06:08

mnel

113,303
27
265
254

Yes, I've read that, but just thought my application was simple enough that I'd give it a go. This answer doesn't solve my exact issue, but it pointed me in a direction to a different and possibly better way of solving the problem. – William Gunn Aug 22 '13 at 06:21

How can I extract URLs containing dashes from HTML using R?

1 Answers1