1

I have created a spider to scrape problems from projecteuler.net. Here I have concluded my answer to a related question with

I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple.

But unfortunately, ordering items to write in json by scrapy (I need ascending order by id field) seem not to be so simple. I've studied every single component (middlewares, pipelines, exporters, signals, etc...) but no one seems useful for this purpose. I'm arrived at the conclusion that a solution to solve this problem doesn't exist at all in scrapy (except, maybe, a very elaborated trick), and you are forced to order things in a second phase. Do you agree, or do you have some idea? I copy here the code of my scraper.

Spider:

# -*- coding: utf-8 -*-
import scrapy
from eulerscraper.items import Problem
from scrapy.loader import ItemLoader


class EulerSpider(scrapy.Spider):
    name = 'euler'
    allowed_domains = ['projecteuler.net']
    start_urls = ["https://projecteuler.net/archives"]

    def parse(self, response):
        numpag = response.css("div.pagination a[href]::text").extract()
        maxpag = int(numpag[len(numpag) - 1])

        for href in response.css("table#problems_table a::attr(href)").extract():
            next_page = "https://projecteuler.net/" + href
            yield response.follow(next_page, self.parse_problems)

        for i in range(2, maxpag + 1):
            next_page = "https://projecteuler.net/archives;page=" + str(i)
            yield response.follow(next_page, self.parse_next)

        return [scrapy.Request("https://projecteuler.net/archives", self.parse)]

    def parse_next(self, response):
        for href in response.css("table#problems_table a::attr(href)").extract():
            next_page = "https://projecteuler.net/" + href
            yield response.follow(next_page, self.parse_problems)

    def parse_problems(self, response):
        l = ItemLoader(item=Problem(), response=response)
        l.add_css("title", "h2")
        l.add_css("id", "#problem_info")
        l.add_css("content", ".problem_content")

        yield l.load_item()

Item:

import re

import scrapy
from scrapy.loader.processors import MapCompose, Compose
from w3lib.html import remove_tags


def extract_first_number(text):
    i = re.search('\d+', text)
    return int(text[i.start():i.end()])


def array_to_value(element):
    return element[0]


class Problem(scrapy.Item):
    id = scrapy.Field(
        input_processor=MapCompose(remove_tags, extract_first_number),
        output_processor=Compose(array_to_value)
    )
    title = scrapy.Field(input_processor=MapCompose(remove_tags))
    content = scrapy.Field()
Lore
  • 1,286
  • 1
  • 22
  • 57
  • How about exporting your unorderd json to an [OrderedDict](https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/ch01s07.html) in python and sort that? – BoboDarph Feb 16 '18 at 13:23
  • 1
    You can't order a JSON, the RFC for them clearly states that they are collections of unordered objects. You can't order the representation of a dict object in python either unless you order the list of keys and parse them in the order of the list. You can get their keys in the same order they are read when your read them in a json object in python, but that helps nothing with the sorting by key value. My conclusion would be that you need to import your results in a different ordered data type (OrderedDict) and do your sorting there, then do whatever you need to do with the sorted data. – BoboDarph Feb 16 '18 at 13:30
  • 1
    Pandas dataframes also handle a lot of usual data sorting operations, might be worth it to look into those too. I don't have a definite answer to give, you have to try out different implementations and see which kind of data structure suits your problem best. – BoboDarph Feb 16 '18 at 13:32
  • @BoboDarph: https://stackoverflow.com/questions/7214293/is-the-order-of-elements-in-a-json-list-preserved Yes, the order of elements in JSON arrays is preserved. From RFC 7159 -The JavaScript Object Notation (JSON) Data Interchange Format (emphasis mine): An object is an unordered collection of zero or more name/value pairs, where a name is a string and a value is a string, number, boolean, null, object, or array. An array is an ordered sequence of zero or more values. The terms "object" and "array" come from the conventions of JavaScript. – Lore Feb 20 '18 at 08:54

2 Answers2

2

If I needed my output file to be sorted (I will assume you have a valid reason to want this), I'd probably write a custom exporter.

This is how Scrapy's built-in JsonItemExporter is implemented.
With a few simple changes, you can modify it to add the items to a list in export_item(), and then sort the items and write out the file in finish_exporting().

Since you're only scraping a few hundred items, the downsides of storing a list of them and not writing to a file until the crawl is done shouldn't be a problem to you.

stranac
  • 26,638
  • 5
  • 25
  • 30
  • I'm going to creater my own exporter instead overwriting it. https://stackoverflow.com/questions/33290876/how-to-create-custom-scrapy-item-exporter – Lore Feb 16 '18 at 15:13
  • 1
    Of course, editing a third-party lib's source files is rarely a good idea. Sorry if my choice of words was confusing. – stranac Feb 16 '18 at 17:47
0

By now I've found a working solution using pipeline:

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.list_items = []
        self.file = open('euler.json', 'w')

    def close_spider(self, spider):
        ordered_list = [None for i in range(len(self.list_items))]

        self.file.write("[\n")

        for i in self.list_items:
            ordered_list[int(i['id']-1)] = json.dumps(dict(i))

        for i in ordered_list:
            self.file.write(str(i)+",\n")

        self.file.write("]\n")
        self.file.close()

    def process_item(self, item, spider):
        self.list_items.append(item)
        return item

Though it may be non optimal, because the guide suggests in another example:

The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

Lore
  • 1,286
  • 1
  • 22
  • 57