Scrapyjs + Splash does not retrieve dynamically loaded content from XHR Requests

Question

I am trying to scrape the comment section content of this link: https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya

However, it is dynamically loaded with Javascript through an XHR request. I have pinpointed the request with Chrome Dev Tools:

https://newcomment.detik.com/graphql?query={ search(type: "comment",size: 10 ,page:1,sort:"newest", adsLabelKanal: "cnn_nasional", adsEnv: "desktop", query: [{name: "news.artikel", terms: "510762" } , {name: "news.site", terms: "cnn"} ]) { paging sorting counter counterparent profile hits { posisi hasAds results { id author content like prokontra status news create_date pilihanredaksi refer liker { id } reporter { id status_report } child { id child parent author content like prokontra status create_date pilihanredaksi refer liker { id } reporter { id status_report } authorRefer } } } } }

It's bloated sorry, but I have also found out that the key to get the comment section of a specific articles at every request is at this specific query string param:

terms: "510762"

Unfortunately, I have not find a way to scrape the required "terms" parameter from the page so that I can simulate the request for many different pages.

That is why I am opting for Scrapyjs & Splash. I have followed the accepted solution at this link: How can Scrapy deal with Javascript

However, the response that I get from scrapy SplashRequest still does not contain javascript loaded content (the comment section)! I have set up settings.py, run splash at docker container as instructed, and modified my scrapy spider to yield this way:

            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

Is there some step that I'm missing or should I just give up and use Selenium for this? Thank you in advance.

score 1 · Accepted Answer · answered Jun 12 '20 at 21:23

You can get the article id by parsing the url directly :

import re

url = "https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya"
articleid = re.search('(\d+)-(\d+)-(\d+)', url).group(3)
print(f"request for article {articleid}")

Note that the last string is the article id here 510762.

Also you can get it from the meta tag with name articleid :

from bs4 import BeautifulSoup
import requests

r = requests.get("https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya")
soup = BeautifulSoup(r.text, "html.parser")
print(soup.find("meta", {"name":"articleid"})["content"])

If you go with the first solution, you don't need to use scraping to get the data if you know the url. Here is an example to get the comments :

import requests
import re

url = "https://www.cnnindonesia.com/nasional/20200607164937-20-510762/risma-usul-ke-khofifah-agar-tak-perpanjang-psbb-surabaya"

articleid = re.search('(\d+)-(\d+)-(\d+)', url).group(3)
print(f"request for article {articleid}")

query = """
{ 
  search(type: "comment",size: 10 ,page:1,sort:"newest", adsLabelKanal: "cnn_nasional", adsEnv: "desktop", query: [{name: "news.artikel", terms: "%s" } , {name: "news.site", terms: "cnn"} ]) { 
    paging 
    sorting 
    counter 
    counterparent 
    profile 
    hits { 
      posisi 
      hasAds 
      results { 
        id 
        author 
        content 
        like 
        prokontra 
        status 
        news 
        create_date 
        pilihanredaksi 
        refer 
        liker { 
          id 
        } 
        reporter { 
          id 
          status_report 
        } 
        child { 
          id 
          child 
          parent 
          author 
          content 
          like 
          prokontra 
          status 
          create_date 
          pilihanredaksi 
          refer 
          liker { 
            id 
          } 
          reporter { 
            id 
            status_report 
          } 
          authorRefer  
        }  
      }  
    }  
  }  
}""" % articleid

r = requests.get("https://newcomment.detik.com/graphql",
    params = {
        "query": query
    })

results = r.json()

print([t for t in results["data"]["search"]["hits"]["results"]])

You, my good sir, is a genius. I am so ashamed that I didn't look into the url itself and the meta tag! The ID was there all along XD. Thank you so much. — RawrDamn, Jun 13 '20 at 04:47

score 1 · Answer 2 · answered Jun 13 '20 at 07:37

1

the term 510762 is in the url of page you can extract it using regex in python later you can use graphpql api to extract commets it will be json format and more easy to collect

answered Jun 13 '20 at 07:37

Sahilpreet singh

11
1

Scrapyjs + Splash does not retrieve dynamically loaded content from XHR Requests

2 Answers2