I am trying to scrape the comment section content of this link:
https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya
However, it is dynamically loaded with Javascript through an XHR request. I have pinpointed the request with Chrome Dev Tools:
https://newcomment.detik.com/graphql?query={ search(type: "comment",size: 10 ,page:1,sort:"newest", adsLabelKanal: "cnn_nasional", adsEnv: "desktop", query: [{name: "news.artikel", terms: "510762" } , {name: "news.site", terms: "cnn"} ]) { paging sorting counter counterparent profile hits { posisi hasAds results { id author content like prokontra status news create_date pilihanredaksi refer liker { id } reporter { id status_report } child { id child parent author content like prokontra status create_date pilihanredaksi refer liker { id } reporter { id status_report } authorRefer } } } } }
It's bloated sorry, but I have also found out that the key to get the comment section of a specific articles at every request is at this specific query string param:
terms: "510762"
Unfortunately, I have not find a way to scrape the required "terms" parameter from the page so that I can simulate the request for many different pages.
That is why I am opting for Scrapyjs & Splash. I have followed the accepted solution at this link: How can Scrapy deal with Javascript
However, the response that I get from scrapy SplashRequest still does not contain javascript loaded content (the comment section)! I have set up settings.py, run splash at docker container as instructed, and modified my scrapy spider to yield this way:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
Is there some step that I'm missing or should I just give up and use Selenium for this? Thank you in advance.