1

I am writing my own Scrapy - Item Pipeline, in order to output individual JSON files into S3. This is my code so far, but I can't work out how to serialize each item into JSON.

NOTE: This is a question on how to serialize a scrapy.Item object not a general question on how to serialize an object.

def process_item(self, item, spider):
  s3_conn = boto.connect_s3(spider.settings.get('AWS_ACCESS_KEY_ID'), spider.settings.get('AWS_SECRET_ACCESS_KEY'))
  bucket = s3_conn.get_bucket(spider.settings.get('AWS_S3_BUCKET'))

  url_path = item['path']

  key = boto.s3.key.Key(bucket, "crawls/" base64.b64encode(url_path) + ".json")

  serialized = json.dumps(item)
  key.set_contents_from_string(serialized)
  return item

However, the above code gives me:

Traceback (most recent call last):


 File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 651, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/cetinick/Projects/cmlsocialbot/lib/spider/spider/pipelines.py", line 23, in process_item
    serialized = json.dumps(item)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 244, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 184, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: {'description': None,
 'h1s': [u'Example Domain'],
 'h2s': [],
 'h3s': [],
 'h4s': [],
 'h5s': [],
 'images': [],
 'inbound_links': [],
 'keywords': [(u'domain', 2),
              (u'examples', 2),
              (u'established', 1),
              (u'documents', 1),
              (u'permission', 1),
              (u'prior', 1),
              (u'coordination', 1),
              (u'illustrative', 1)],
 'keywords_count': 10,
 'outbound_links': [{'nofollow': False,
 'text': 'More information...',
 'url': 'http://www.iana.org/domains/example'}],
 'path': '',
 'title': u'Example Domain',
 'url': 'http://example.com',
 'words_count': 28} is not JSON serializable

items.py

class ItemLink(scrapy.Item):
    url = scrapy.Field()
    text = scrapy.Field()
    nofollow = scrapy.Field()

class ItemImage(scrapy.Item):
    src = scrapy.Field()
    alt = scrapy.Field()
    title = scrapy.Field()

class SpiderPage(scrapy.Item):
    url = scrapy.Field()
    path = scrapy.Field()

    title = scrapy.Field()
    description = scrapy.Field()

    h1s = scrapy.Field()
    h2s = scrapy.Field()
    h3s = scrapy.Field()
    h4s = scrapy.Field()
    h5s = scrapy.Field()

    keywords_count = scrapy.Field()
    words_count = scrapy.Field()

    keywords = scrapy.Field()

    outbound_links = scrapy.Field(serializer=ItemLink)
    inbound_links = scrapy.Field(serializer=ItemLink)

    images = scrapy.Field(serializer=ItemImage)
Andrew Cetinic
  • 2,805
  • 29
  • 44
  • what does `item` look like – e4c5 Jan 01 '17 at 07:57
  • Hi, added my items.py file – Andrew Cetinic Jan 01 '17 at 08:03
  • Can you add an example of calling `process_item(self, item, spider)`, and the _content_ of something that would be fed in via `item`? There's just not enough context here to answer the Question. Also, the full stacktrace in which `raise TypeError(repr(o) + " is not JSON serializable")` appears? Also, what happens if you use `json.loads` in place of `json.dumps`? – Chris Larson Jan 01 '17 at 08:46
  • Hi, added stacktrace full and the object that it is trying to serialize is shown above. – Andrew Cetinic Jan 01 '17 at 08:54
  • Possible duplicate of [How to make a class JSON serializable](http://stackoverflow.com/questions/3768895/how-to-make-a-class-json-serializable). Generally speaking, the simplest way is to just use ordinary types that *are* JSON serializable out of the box, e.g., `dict`, `str`, `list`, etc. – jpmc26 Jan 01 '17 at 08:59
  • Hi @jpmc26, thanks for the tip, however this is not a duplicate. It is a Scrapy specific question on how I can serialize a scrapy.Item object. – Andrew Cetinic Jan 01 '17 at 09:16
  • Just because a question involves a certain technology doesn't make it "not a duplicate." In this case, the library happened to provide an intermediary that allowed you to use standard Python serialization modules (and I'm glad you found it), but if it didn't had one, your answer would be in the question I suggested. (Whether it's a duplicate depends largely on the *answer*, not just on the question.) – jpmc26 Jan 01 '17 at 10:59
  • Not going to even entertain this. – Andrew Cetinic Jan 01 '17 at 11:06

1 Answers1

1

For those wanting an ItemPipeline to export to S3, this the working code I came up with to output each item into S3.

import boto
import boto.s3
import sys
import json
import base64
from boto.s3.key import Key
from scrapy.exporters import PythonItemExporter

class JsonWriterPipeline(object):
    def _get_exporter(self, **kwargs):
        return PythonItemExporter(binary=False, **kwargs)

    def process_item(self, item, spider):
        s3_conn = boto.connect_s3(spider.settings.get('AWS_ACCESS_KEY_ID'), spider.settings.get('AWS_SECRET_ACCESS_KEY'))
        bucket = s3_conn.get_bucket(spider.settings.get('AWS_S3_BUCKET'))

        url_path = item['path']
        if url_path == "":
            url_path = "/"

        ie = self._get_exporter()
        exported = ie.export_item(item)

        key = boto.s3.key.Key(bucket, "crawls/" + spider.site_id + base64.b64encode(url_path) + ".json")
        key.set_contents_from_string(json.dumps(exported))
        return item
Andrew Cetinic
  • 2,805
  • 29
  • 44
  • Note that this first serializes the object to standard Python types like `dict`, `list`, etc. and then uses the standard Python `json` module to produce the desired JSON. (The answer would benefit from some explanation about how it works, including that detail.) – jpmc26 Jan 01 '17 at 10:52
  • This didn't solve my problem. My exported item still is nested in side a `_values` object. – kenecaswell Oct 24 '19 at 20:45