1

I'm using Scrapy to craw data from website, and this is my code at file spider.py in folder spider of Scrapy

class ThumbSpider(scrapy.Spider):
    userInput = readInputData('input/user_input.json')
    name = 'thumb'
    # start_urls = ['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society']

    def __init__(self, *args, **kwargs): 
        super(ThumbSpider, self).__init__(*args, **kwargs)
        self.start_urls = kwargs.get('start_urls')

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for cssThumb in self.userInput['cssThumb']: # browse each cssThumb which user provides
            items = response.css('{0}::attr(href)'.format(cssThumb)).getall() # access it

            for item in items:
                item = response.urljoin(item)
                yield scrapy.Request(url=item, callback=self.parse_details)

    def parse_details(self, response):
        data = response.css('div.vnnews-text-post p span::text').extract()

        with open('result/page_content.txt', 'a') as outfile:
            json.dump(data, outfile)

        yield data

I call class ThumbSpider in file main.py and run this file in terminal

import json
import os
import modules.misc as msc
from scrapy.crawler import CrawlerProcess
from week_7.spiders.spider import NaviSpider, ThumbSpider

process2 = CrawlerProcess()

process2.crawl(ThumbSpider, start_urls=['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society'])
process2.start()

My program doesn't get anything from 2 urls, but when I uncomment start_urls = ['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society'] and delete __init__ and start_requests methods in class ThumbSpider and in file main.py edit process2.crawl(ThumbSpider, start_urls=msc.getUserChoices()) into process2.crawl(ThumbSpider) it worked well. I don't know what happening. Anyone can help me, thank you so much

Claire Duong
  • 103
  • 1
  • 7
  • Does it work if you use a parameter name other than `start_urls` to pass them to the spider? – Gallaecio Jun 15 '20 at 10:33
  • it still worked well if this code like this: `process2.crawl(ThumbSpider, start_urls=['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society'])`, but when I changed it into `process2.crawl(ThumbSpider, start_urls=msc.getUserChoices())` it will not work, the getUserChoices() gets data from json file and return a list of urls – Claire Duong Jun 15 '20 at 11:18
  • Does `msc.getUserChoices()` return `['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society']`, or something else? – Gallaecio Jun 15 '20 at 11:26
  • yes `msc.getUserChoices()` return `['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society']` – Claire Duong Jun 15 '20 at 11:29
  • `['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society']` is contained in a json file – Claire Duong Jun 15 '20 at 11:29
  • It does not make sense to me that `['a', 'b']` works but a call that returns `['a', 'b']` does not. I can only assume that function is not returning `['a', 'b']`, with the information given. – Gallaecio Jun 15 '20 at 13:39
  • I am sure that the `getUserChoices()` function returning a list of urls because I tested it, but I don't know why it does not work when I assign it to start_urls – Claire Duong Jun 15 '20 at 13:42

0 Answers0