1

I am using a ScrapingHub API, and am using shub, to deploy my project. However, the items result is in as shown:

Example Item Output

Unfortunately, I need it in the following order --> Title, Publish Date, Description, Link. How can I get the output to be in exactly that order for every item class?

Below is a short sample of my spider:

import scrapy

from scrapy.spiders import XMLFeedSpider
from tickers.items import tickersItem
class Spider(XMLFeedSpider):
    name = "Scraper"
    allowed_domains = ["yahoo.com"]
    start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=ABIO,ACFN,AEMD,AEZS,AITB,AJX,AU,AKERMN,AUPH,AVL,AXPW
                  'https://feeds.finance.yahoo.com/rss/2.0/headline?s=DRIO
                  'https://feeds.finance.yahoo.com/rss/2.0/headline?s=IDXG,IMMU,IMRN,IMUC,INNV,INVT,IPCI,INPX,JAGX,KDMN,KTOV,LQMT
                  )
    itertag = 'item'

    def parse_node(self, response, node):
        item = {}
        item['Title'] = node.xpath('title/text()',).extract_first()
        item['Description'] = node.xpath('description/text()').extract_first()
        item['Link'] = node.xpath('link/text()').extract_first()
        item['PublishDate'] = node.xpath('pubDate/text()').extract_first()
        return item

Additionally, here is my attached items.py file, it is in the same order as my spider, so I have no idea why the output is not in order.

Items.py:

import scrapy

class tickersItem(scrapy.Item):
    Title = scrapy.Field()
    Description = scrapy.Field()
    Link = scrapy.Field()
    PublishDate = scrapy.Field()

The syntax of my code is in order for both the items and the spider file, and I have no idea how to fix it. I am a new python programmer.

François Maturel
  • 5,884
  • 6
  • 45
  • 50
Friezan
  • 41
  • 2
  • 7

1 Answers1

2

Instead of defining items in items.py, you could use collections.OrderedDict. Just import collections module and in parse_node method, change the line:

item = {}

to line:

item = collections.OrderedDict()

Or, if you want defined items, you could use approach outlined in this answer. Your items.py would then contain this code:

from collections import OrderedDict

from scrapy import Field, Item
import six

class OrderedItem(Item):
    def __init__(self, *args, **kwargs):
        self._values = OrderedDict()
        if args or kwargs:  # avoid creating dict for most common case
            for k, v in six.iteritems(dict(*args, **kwargs)):
                self[k] = v

class tickersItem(OrderedItem):
    Title = Field()
    Description = Field()
    Link = Field()
    PublishDate = Field()

You should then also modify your spider code to use this item, accordingly. Refer to the documentation.

Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39