rvest: read_html() can't read a URL containing '#'

Question

I am scraping a music streaming website where new songs are updated and indexed. The 1st page is only successfully loaded by read_html function. But it doesn't work for the 2nd page and onward - instead, the function returns the 1st page again.

Turns out it results from the structure of the URLs. That is.

The URL of the 1st page (displaying 50 songs) is:

https://www.melon.com/genre/song_list.htm?gnrCode=GN0300

And the URL of the 2nd page (displaying 51st -100th song) just adds some string behind the first one, starting #:

https://www.melon.com/genre/song_list.htm?gnrCode=GN0300#params%5BgnrCode%5D=GN0300&params%5BdtlGnrCode%5D=&params%5BorderBy%5D=NEW&params%5BsteadyYn%5D=N&po=pageObj&startIndex=51

read_html doesn't seem to take the part beginning from '#'; so basically, it operates as if I put in the same URL of the first page again.

The 3rd page is different only in "startIndex=101", as it begins from the 101st song. read_html returns the 1st page too.

I think this problem is rooted in the way R treats content containing "#", as that punctuation is associated with commenting. Would there be other ways around to let it identify correct URLs? Or a quick fix will be very much appreciated. Thanks.

Welcome to StackOverflow. Please read [how to make a reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so we can see your code and try to answer your question. — M--, Nov 20 '18 at 20:32
A "#" is a link to an achor in a page. With "normal" HTML pages, whatever is after the "#" will only determine where the page scrolls to when you load it. No matter what you put after the "#", your browser will always load the same HTML page. Not this might be different if the page uses a lot of javascript; but if the page uses javascript to load content, you can't use just `read_html`. You need something to execute the javascript like "RScelenium" `read_html` is probably doing exactly what it's supposed to. Try turning off javascript in your browser and see what happens. — MrFlick, Nov 20 '18 at 20:33
Their robots.txt — https://www.melon.com/robots.txt — explicitly restricts your actions. You likely don't but anyone with an ethical bone in their skeleton now knows they should refrain from assisting you. — hrbrmstr, Nov 20 '18 at 20:52

rvest: read_html() can't read a URL containing '#'

0 Answers0