3

I have this scrapy spider that runs well:

`# -*- coding: utf-8 -*-
import scrapy


class AllCategoriesSpider(scrapy.Spider):
    name = 'vieles'
    allowed_domains = ['examplewiki.de']
    start_urls = ['http://www.exampleregelwiki.de/index.php/categoryA.html','http://www.exampleregelwiki.de/index.php/categoryB.html','http://www.exampleregelwiki.de/index.php/categoryC.html',]

#"Titel": :

def parse(self, response):
    urls = response.css('a.ulSubMenu::attr(href)').extract() # links to den subpages
    for url in urls:
        url = response.urljoin(url)
        yield scrapy.Request(url=url,callback=self.parse_details)

def parse_details(self,response):
    yield {
        "Titel": response.css("li.active.last::text").extract(),
        "Content": response.css('div.ce_text.first.last.block').extract(),
    }

` with

scrapy runspider spider.py -o dat.json it saves all info to dat.json

I whould like to have a outputfile per start url categoryA.json categoryB.json and so on.

A similar question has been left unanswered, I cannot reproduce this answer and I am not able to learn form the suggestions there.

How do I achive the goal of having several outputfiles, one per starturl? I whould like to only run one command/shellscript/file to achive this.

Nivatius
  • 260
  • 1
  • 13

1 Answers1

7

You didn't use real urls in code so I use my page for test.
I have to changed css selectors and I used different fields.

I save it as csv because it is easier to append data.
JSON would need to read all items from file, add new item and save all items again in the same file.


I create extra field Category to use it later as filename in pipeline

items.py

import scrapy

class CategoryItem(scrapy.Item):
    Title = scrapy.Field()
    Date = scrapy.Field()
    # extra field use later as filename 
    Category = scrapy.Field()

In spider I get category from url and send to parse_details using meta in Request.
In parse_details I add category to Item.

spiders/example.py

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['blog.furas.pl']
    start_urls = ['http://blog.furas.pl/category/python.html','http://blog.furas.pl/category/html.html','http://blog.furas.pl/category/linux.html']

    def parse(self, response):

        # get category from url
        category = response.url.split('/')[-1][:-5]

        urls = response.css('article a::attr(href)').extract() # links to den subpages

        for url in urls:
            # skip some urls
            if ('/tag/' not in url) and ('/category/' not in url):
                url = response.urljoin(url)
                # add category (as meta) to send it to callback function
                yield scrapy.Request(url=url, callback=self.parse_details, meta={'category': category})

    def parse_details(self, response):

        # get category
        category = response.meta['category']

        # get only first title (or empty string '') and strip it
        title = response.css('h1.entry-title a::text').extract_first('')
        title = title.strip()

        # get only first date (or empty string '') and strip it
        date = response.css('.published::text').extract_first('')
        date = date.strip()

        yield {
            'Title': title,
            'Date': date,
            'Category': category,
        }

In pipeline I get category and use it to open file for appending and save item.

pipelines.py

import csv

class CategoryPipeline(object):

    def process_item(self, item, spider):

        # get category and use it as filename
        filename = item['Category'] + '.csv'

        # open file for appending
        with open(filename, 'a') as f:
            writer = csv.writer(f)

            # write only selected elements 
            row = [item['Title'], item['Date']]
            writer.writerow(row)

            #write all data in row
            #warning: item is dictionary so item.values() don't have to return always values in the same order
            #writer.writerow(item.values())

        return item

In settings I have to uncomment pipelines to activate it.

settings.py

ITEM_PIPELINES = {
    'category.pipelines.CategoryPipeline': 300,
}

Full code on GitHub: python-examples/scrapy/save-categories-in-separated-files


BTW: I think you could write in files directly in parse_details.

furas
  • 134,197
  • 12
  • 106
  • 148
  • I never worked with projects, only spiders so your answer is a little over my head. How does one use parse_details? – Nivatius Nov 19 '17 at 15:24
  • as default `Request` gets data from page and executes `parse()` but in your code you execute `Request` to get subpabe and execute `parse_details()` (`scrapy.Request(..., callback=self.parse_details,...)`). If you didn't work with project then you could use `with open(...) ...` directly in `parse_details` and you will no need `CategoryPipeline` – furas Nov 19 '17 at 15:29
  • do you mean I could write the file right in the parse details function? – Nivatius Nov 19 '17 at 15:58
  • yes, you can write files in function `parse_details` - it is your code, nobody can stop you ;) – furas Nov 19 '17 at 23:16