GET URL with Hashtag in r

Question

I like to get the url with hashtag with the r GET function from the httr package

httr::GET("https://en.wikipedia.org/wiki/Kona_Lanes#Peak_years")

However only the url before the hashtag is returned.

Another example is the following. The results for the "first" and "second" page are

 library(httr)
 url1 = "example.com"
 url2 = "example.com#foo=bar"
 res1 <-  htmlTreeParse(GET(paste("https://www.",url1,sep="") ),useInternalNodes = TRUE)
 res2 <-  htmlTreeParse(GET(paste("https://www.",url2,sep="")),useInternalNodes = TRUE)

Have a look at the `URLencode` function for handling special characters in urls — Andrew Gustar, Apr 20 '17 at 13:24
The hash part of the URL is a merely client-side construct, and not even send to the server by normal browsers. So I don’t see what exactly you expect to go differently here. — CBroe, Apr 20 '17 at 13:53
@peter I think the `reserved=TRUE` is causing problems if you also encode the `https://` part. In any case, all `URLencode` does in this case is to replace `#` with `%23`, so it probably won't improve things for you. I think your problem might be that the `#Peak_years` is simply a bookmark on the page, which directs a browser to jump down to that section. As far as `GET` is concerned, it still needs to load the whole page, so probably just ignores the bookmark. Your best bet is probably to load the whole page and then write some code to extract the #Peak_years bit. — Andrew Gustar, Apr 20 '17 at 18:15
Scraping Amazon is against their TOS so I can't help any further. — hrbrmstr, Apr 20 '17 at 20:26
https://curl.haxx.se/mail/lib-2011-11/0178.html && http://stackoverflow.com/a/24726986/1457051 — hrbrmstr, Apr 20 '17 at 20:36

score 0 · Answer 1 · answered Apr 20 '17 at 13:22

0

use %23 where # is present in URL

httr::GET("https://en.wikipedia.org/wiki/Kona_Lanes%23Peak_years")

answered Apr 20 '17 at 13:22

Mukeshkumar S

785
1
14
30

I added another example: adding #2 should give the "next page". Replacing #2 by %232 gives in essence the same result. – peter Apr 20 '17 at 17:42
I just can't help without exact url you are trying ..... As far as you asked to replace # i said.... – Mukeshkumar S Apr 20 '17 at 21:26

score 0 · Answer 2 · answered Apr 20 '17 at 13:50

0

What isn't working and/or what did you expect to be different with the fragment identifier and without it?

library(httr)
library(purrr)

res1 <- httr::GET("https://en.wikipedia.org/wiki/Kona_Lanes#Peak_years")
res2 <- httr::GET("https://en.wikipedia.org/wiki/Kona_Lanes%23Peak_years")
res3 <- httr::GET("https://en.wikipedia.org/wiki/Kona_Lanes")

txt1 <- content(res1, as="text")
txt2 <- content(res2, as="text")
txt3 <- content(res3, as="text")

identical(txt1, txt2)
## [1] TRUE

identical(txt2, txt3)
## [1] TRUE

identical(txt1, txt3)
## [1] TRUE

answered Apr 20 '17 at 13:50

hrbrmstr

77,368
11
139
205

I added another example: addin #2 should give the "next page". – peter Apr 20 '17 at 15:14
Replacing #2 by %232 gives in essence the same result of the "first page" and not of the "second page". Is there another "trick" or maybe a configuration for the GET command how to do it? – peter Apr 20 '17 at 17:50
Scraping Amazon is against their TOS so I can't help any further. – hrbrmstr Apr 20 '17 at 20:27
Thanks - I changed the example. – peter Apr 20 '17 at 21:10

GET URL with Hashtag in r

2 Answers2