0

I like to get the url with hashtag with the r GET function from the httr package

httr::GET("https://en.wikipedia.org/wiki/Kona_Lanes#Peak_years")

However only the url before the hashtag is returned.

Another example is the following. The results for the "first" and "second" page are

 library(httr)
 url1 = "example.com"
 url2 = "example.com#foo=bar"
 res1 <-  htmlTreeParse(GET(paste("https://www.",url1,sep="") ),useInternalNodes = TRUE)
 res2 <-  htmlTreeParse(GET(paste("https://www.",url2,sep="")),useInternalNodes = TRUE)
peter
  • 41
  • 3
  • use %23 where # is present – Mukeshkumar S Apr 20 '17 at 13:20
  • 1
    Have a look at the `URLencode` function for handling special characters in urls – Andrew Gustar Apr 20 '17 at 13:24
  • The hash part of the URL is a merely client-side construct, and not even send to the server by normal browsers. So I don’t see what exactly you expect to go differently here. – CBroe Apr 20 '17 at 13:53
  • @peter I think the `reserved=TRUE` is causing problems if you also encode the `https://` part. In any case, all `URLencode` does in this case is to replace `#` with `%23`, so it probably won't improve things for you. I think your problem might be that the `#Peak_years` is simply a bookmark on the page, which directs a browser to jump down to that section. As far as `GET` is concerned, it still needs to load the whole page, so probably just ignores the bookmark. Your best bet is probably to load the whole page and then write some code to extract the #Peak_years bit. – Andrew Gustar Apr 20 '17 at 18:15
  • Scraping Amazon is against their TOS so I can't help any further. – hrbrmstr Apr 20 '17 at 20:26
  • https://curl.haxx.se/mail/lib-2011-11/0178.html && http://stackoverflow.com/a/24726986/1457051 – hrbrmstr Apr 20 '17 at 20:36
  • Thanks - I changed the example. – peter Apr 20 '17 at 21:10

2 Answers2

0
use %23 where # is present in URL

httr::GET("https://en.wikipedia.org/wiki/Kona_Lanes%23Peak_years")

Mukeshkumar S
  • 785
  • 1
  • 14
  • 30
  • I added another example: adding #2 should give the "next page". Replacing #2 by %232 gives in essence the same result. – peter Apr 20 '17 at 17:42
  • I just can't help without exact url you are trying ..... As far as you asked to replace # i said.... – Mukeshkumar S Apr 20 '17 at 21:26
0

What isn't working and/or what did you expect to be different with the fragment identifier and without it?

library(httr)
library(purrr)

res1 <- httr::GET("https://en.wikipedia.org/wiki/Kona_Lanes#Peak_years")
res2 <- httr::GET("https://en.wikipedia.org/wiki/Kona_Lanes%23Peak_years")
res3 <- httr::GET("https://en.wikipedia.org/wiki/Kona_Lanes")

txt1 <- content(res1, as="text")
txt2 <- content(res2, as="text")
txt3 <- content(res3, as="text")

identical(txt1, txt2)
## [1] TRUE

identical(txt2, txt3)
## [1] TRUE

identical(txt1, txt3)
## [1] TRUE
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205