1

I am scraping a music streaming website where new songs are updated and indexed. The 1st page is only successfully loaded by read_html function. But it doesn't work for the 2nd page and onward - instead, the function returns the 1st page again.

Turns out it results from the structure of the URLs. That is.

The URL of the 1st page (displaying 50 songs) is:

https://www.melon.com/genre/song_list.htm?gnrCode=GN0300

And the URL of the 2nd page (displaying 51st -100th song) just adds some string behind the first one, starting #:

https://www.melon.com/genre/song_list.htm?gnrCode=GN0300#params%5BgnrCode%5D=GN0300&params%5BdtlGnrCode%5D=&params%5BorderBy%5D=NEW&params%5BsteadyYn%5D=N&po=pageObj&startIndex=51

read_html doesn't seem to take the part beginning from '#'; so basically, it operates as if I put in the same URL of the first page again.

The 3rd page is different only in "startIndex=101", as it begins from the 101st song. read_html returns the 1st page too.

I think this problem is rooted in the way R treats content containing "#", as that punctuation is associated with commenting. Would there be other ways around to let it identify correct URLs? Or a quick fix will be very much appreciated. Thanks.

m0nhawk
  • 22,980
  • 9
  • 45
  • 73
  • 1
    Welcome to StackOverflow. Please read [how to make a reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so we can see your code and try to answer your question. – M-- Nov 20 '18 at 20:32
  • 1
    A "#" is a link to an achor in a page. With "normal" HTML pages, whatever is after the "#" will only determine where the page scrolls to when you load it. No matter what you put after the "#", your browser will always load the same HTML page. Not this might be different if the page uses a lot of javascript; but if the page uses javascript to load content, you can't use just `read_html`. You need something to execute the javascript like "RScelenium" `read_html` is probably doing exactly what it's supposed to. Try turning off javascript in your browser and see what happens. – MrFlick Nov 20 '18 at 20:33
  • 1
    Their robots.txt — https://www.melon.com/robots.txt — explicitly restricts your actions. You likely don't but anyone with an ethical bone in their skeleton now knows they should refrain from assisting you. – hrbrmstr Nov 20 '18 at 20:52

0 Answers0