0

I have some HTML that looks like this:

<ul><li><a href="http://www.website.com/index.aspx" target="_blank">Website</a></li>
<li><a href="http://website.com/index.html" target="_blank">Website</a></li>
<li><a href="http://www.website-with-dashes.org" target="_blank">Website With Dashes</a></li>
<li><a href="http://website2.org/index.htm" target="_blank">Website 2</a></li>
<li><a href="http://www.another-site.com/">Another Site</a></li>

using

m<-regexpr("http://\\S*/?", links, perl=T)
links<-regmatches(links, m)

gets me the links, except the ones with dashes in them are truncated like this:

http://www.website.com/index.aspx
http://website.com/index.html
http://www.website
http://website2.org/index.htm
http://www.another-site.com/

I thought /S matched any non-whitespace. What's going on?

William Gunn
  • 2,925
  • 8
  • 26
  • 22
  • I can't replicate your issue. If I replace the `"` with `\"` so I can import the text with `readLines`, everything works as you intended. – thelatemail Aug 22 '13 at 06:05

1 Answers1

4

Use XML::getHTMLlinks

eg

library(XML)
# assuming your html document is'foo.html')

 getHTMLLinks(doc = 'foo.html')
# [1] "http://www.website.com/index.aspx"  "http://website.com/index.html"      "http://www.website-with-dashes.org"
# [4] "http://website2.org/index.htm"      "http://www.another-site.com/" 

parsing HTML with regex not necessarily straightforward. https://stackoverflow.com/a/1732454/1385941 is and interesting read.

Community
  • 1
  • 1
mnel
  • 113,303
  • 27
  • 265
  • 254
  • Yes, I've read that, but just thought my application was simple enough that I'd give it a go. This answer doesn't solve my exact issue, but it pointed me in a direction to a different and possibly better way of solving the problem. – William Gunn Aug 22 '13 at 06:21