This might be a long shot, but people have always been really helpful with the questions I've posted in the past so I'm gonna try. If anyone could help me, that would be amazing...
I'm trying to use Scrapy to get search results (links) after searching for a keyword on a Chinese online newspaper - pages like this
When I inspect the html for the page in Chrome, the links to the articles seem to be there. But then when I try to grab it using a Scrapy spider, the html is much more basic and the links I want don't show up. I think this may be because the results are being drawn to the page using JavaScript? I've tried combining Scrapy with 'scrapy-selenium' to get round this, but it is still not working. I have heard Splash might work, but this seems complicated to set up.
Here is the code for my Scrapy spider:
import scrapy
from scrapy_selenium import SeleniumRequest
class QuotesSpider(scrapy.Spider):
name = "XH"
def start_requests(self):
urls = [
'http://so.news.cn/#search/0/%E4%B8%80%E5%B8%A6%E4%B8%80%E8%B7%AF/1/'
]
for url in urls:
yield SeleniumRequest(url=url, wait_time=90, callback=self.parse)
def parse(self, response):
print(response.request.meta['driver'].title)
page = response.url.split("/")[-2]
filename = 'XH-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
I can also post any of the other Scrapy files, if that is helpful. I have also modified settings.py
- following these instructions.
Any help would be really appreciated. I'm completely stuck with this!