I am scraping a music streaming website where new songs are updated and indexed. The 1st page is only successfully loaded by read_html function. But it doesn't work for the 2nd page and onward - instead, the function returns the 1st page again.
Turns out it results from the structure of the URLs. That is.
The URL of the 1st page (displaying 50 songs) is:
https://www.melon.com/genre/song_list.htm?gnrCode=GN0300
And the URL of the 2nd page (displaying 51st -100th song) just adds some string behind the first one, starting #:
https://www.melon.com/genre/song_list.htm?gnrCode=GN0300#params%5BgnrCode%5D=GN0300¶ms%5BdtlGnrCode%5D=¶ms%5BorderBy%5D=NEW¶ms%5BsteadyYn%5D=N&po=pageObj&startIndex=51
read_html
doesn't seem to take the part beginning from '#'; so basically, it operates as if I put in the same URL of the first page again.
The 3rd page is different only in "startIndex=101", as it begins from the 101st song. read_html returns the 1st page too.
I think this problem is rooted in the way R treats content containing "#", as that punctuation is associated with commenting. Would there be other ways around to let it identify correct URLs? Or a quick fix will be very much appreciated. Thanks.