0

This might be a long shot, but people have always been really helpful with the questions I've posted in the past so I'm gonna try. If anyone could help me, that would be amazing...

I'm trying to use Scrapy to get search results (links) after searching for a keyword on a Chinese online newspaper - pages like this

When I inspect the html for the page in Chrome, the links to the articles seem to be there. But then when I try to grab it using a Scrapy spider, the html is much more basic and the links I want don't show up. I think this may be because the results are being drawn to the page using JavaScript? I've tried combining Scrapy with 'scrapy-selenium' to get round this, but it is still not working. I have heard Splash might work, but this seems complicated to set up.

Here is the code for my Scrapy spider:

import scrapy
from scrapy_selenium import SeleniumRequest


class QuotesSpider(scrapy.Spider):
    name = "XH"

    def start_requests(self):
        urls = [
            'http://so.news.cn/#search/0/%E4%B8%80%E5%B8%A6%E4%B8%80%E8%B7%AF/1/'
        ]
        for url in urls:
            yield SeleniumRequest(url=url, wait_time=90, callback=self.parse)

    def parse(self, response):
        print(response.request.meta['driver'].title)
        page = response.url.split("/")[-2]
        filename = 'XH-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

I can also post any of the other Scrapy files, if that is helpful. I have also modified settings.py - following these instructions.

Any help would be really appreciated. I'm completely stuck with this!

Moein Kameli
  • 976
  • 1
  • 12
  • 21
Nick Olczak
  • 305
  • 3
  • 14
  • Please double check the code you pasted. I think you did bad copy/paste. – sen4ik Dec 09 '19 at 22:31
  • Thanks for your reply. I've corrected the input of the code. – Nick Olczak Dec 10 '19 at 07:39
  • https://docs.scrapy.org/en/latest/topics/dynamic-content.html – Gallaecio Dec 10 '19 at 08:04
  • @Gallaecio - thanks for commenting. I've read through this and it seems to point to using Splash (through Docker). Is that the only way? Is it not possible to do this through Selenium as I've been trying to do...? Thanks for any help. – Nick Olczak Dec 10 '19 at 09:17
  • You can use downloader middleware to override the content extracting process see [this](https://stackoverflow.com/a/31186730/1578952). You can use [this](https://github.com/clemfromspace/scrapy-selenium) library to do such a job. – thirdDeveloper Dec 10 '19 at 09:28
  • @nolczak The last section covers Selenium. But the point is that you should try to understand how the site works and try to use regular Scrapy before you fall back to Splash or Selenium. – Gallaecio Dec 10 '19 at 10:29

1 Answers1

0

In inspect tool open network tab and watch requests you will find out the data is coming from this url, so crawl this instead with normal scrapy.Request().
spider would be like this:

import scrapy
import json

class QuotesSpider(scrapy.Spider):
    name = "XH"

    def start_requests(self):
        urls = [
            'http://so.news.cn/getNews?keyword=%E4%B8%80%E5%B8%A6&curPage=1&sortField=0&searchFields=1&lang=cn'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        json_data = json.loads(response.body.decode('utf-8'))
        for data in json_data['content']['results']:
            yield {
                'url': data['url']
            }
Moein Kameli
  • 976
  • 1
  • 12
  • 21
  • Thank you very much for this answer - it works well. Is there any way to automatically find the url the data is coming from (so I don't have to inspect each page)? Also, could you tell me how I parse the crawl of that page to get the links out of it - is it in Java? Thanks again! – Nick Olczak Dec 10 '19 at 15:48
  • It is the best way I know, so let me know if you found something better. I made some updates and added parse method check it out. – Moein Kameli Dec 10 '19 at 20:34
  • Thank you again for your help. I really appreciate it. I put the parse method in and then have been running 'scrapy crawl XH -o results.json' from the terminal. However, the 'results.json' file produced is blank - any ideas why? – Nick Olczak Dec 10 '19 at 21:51
  • I just checked it, it works fine for me, wanna double check if you are missing something wrong in it, did you change the URL for instance? Does log look okay? – Moein Kameli Dec 10 '19 at 22:13
  • I checked the URL etc. and all seems okay. When changing the code above, I just removed the connection with Selenium, changed the URL to the one for the data, and then replaced the code below 'def parse(self, response):' with the code you sent me for parsing. Is there a way you can send me the complete code for the spider you made it work with? Thank you! – Nick Olczak Dec 10 '19 at 22:55
  • I put the whole spider, It is working for me. Also I found out the search result doesn't exist anymore, additionally I know nothing about Chinese, so that I changed the search keywords blindly but it shows some results. – Moein Kameli Dec 11 '19 at 13:17
  • Sorry, I was at a meeting yesterday so couldn't look until now. I had not added the 'import json' line That works great now! Thank you so much for your help! I really appreciate it. – Nick Olczak Dec 12 '19 at 11:48