0

I'm using Scrapy to crawl a website, and I am generating a document that's pretty large - there are 3 properties, one of them is an array with over 5 thousand objects and each one of those objects has some properties and small arrays inside them. In total, it should become above 2MB if it was written to a file, which is not really that big.

After I crawl an object, I use the scrapy-mongodb pipeline to upsert it to the database. Everytime, I get an error as the ones in this gist: https://gist.github.com/ranisalt/ac572185e11e5918082b

(there are 6 errors in total, 1 for each object, but crawler output was too large and was cut)

Those objects that fail to encode are on the large array I mentioned on the first line.

What can possibly make an object fail to be encoded by pymongo and what may be applied to my documents?

If there is need for anything please ask on comments

ranieri
  • 2,030
  • 2
  • 21
  • 39
  • I tried to insert one of the documents and it worked without any errors for me, which version of mongodb are you using and how are you inserting the documents on the db? – Rafael Barros Jan 06 '15 at 20:57
  • I'm using version 2.4.6. The examples are not a document I'm trying to insert but rather objects nested into the document. I'm going to upload an entire document. – ranieri Jan 06 '15 at 21:00
  • Here it is: https://gist.github.com/ranisalt/d7320d6993664e87b7c0 this is an entire document to be inserted – ranieri Jan 06 '15 at 21:26
  • It inserts as it should on mongo 2.6 – Rafael Barros Jan 06 '15 at 21:38

1 Answers1

0

The problem you encountered, I believe is due to the escaped characters not fully converted to utf-8 format before inserting into mongoDB from Python.

I haven't check MongoDB change log, but if I remember correctly since v.2.2+ full unicode should be supported.

Anyway, you have 2 approaches, upgrade to newer version of mongoDB 2.6, or modify/override your scrapy-mongodb script. To change the scrapy_mongodb.py, look at these lines, k isn't converted to utf-8 before inserting into mongodb:

# ... previous code ...
        key = {}
        if isinstance(self.config['unique_key'], list):
            for k in dict(self.config['unique_key']).keys():
                key[k] = item[k]
        else:
            key[self.config['unique_key']] = item[self.config['unique_key']]

        self.collection.update(key, item, upsert=True)
# ... and the rest ...

To fix this, you can add this few lines within process_item function:

# ... previous code ...
def process_item(self, item, spider):
    """ Process the item and add it to MongoDB
    :type item: Item object
    :param item: The item to put into MongoDB
    :type spider: BaseSpider object
    :param spider: The spider running the queries
    :returns: Item object
    """
    item = dict(self._get_serialized_fields(item))
    # add a recursive function to convert all unicode to utf-8 format
    # take this snippet from this [SO answer](http://stackoverflow.com/questions/956867/how-to-get-string-objects-instead-of-unicode-ones-from-json-in-python)
    def byteify(input):
        if isinstance(input, dict):
            return {byteify(key):byteify(value) for key,value in input.iteritems()}
        elif isinstance(input, list):
            return [byteify(element) for element in input]
        elif isinstance(input, unicode):
            return input.encode('utf-8')
            # if above utf-8 conversion still not working, replace them completely
            # return input.encode('ASCII', 'ignore')
        else:
            return input
    # finally replace the item with this function
    item = byteify(item)
    # ... rest of the code ... #

If this is still not working, advised to upgrade your mongodb to newer version.

Hope this helps.

Anzel
  • 19,825
  • 5
  • 51
  • 52
  • I don't believe it's Mongo problem. I have adapted your byteify function, it helps to "de-unicode" the strings that were previously unicode, but they get double escaped. Where `História` was escaped to `Hist\xf3ria` for unicode, now it is `Hist\xc3\xb3ria`, and I still can not insert. – ranieri Jan 07 '15 at 00:45
  • @ranisalt, have you tried `.encode('ASCII', 'ignore')` to actually remove the unicode? – Anzel Jan 07 '15 at 00:56
  • Yup, tried it now, and again it did not work. I will try to update Mongo. – ranieri Jan 07 '15 at 01:08
  • I think it is probably that the strings are being extracted as unicode. If I manually insert them with diacritics through the Mongo shell it works fine. Any idea on how to stop strings being converted to unicode? – ranieri Jan 07 '15 at 01:29
  • @ranisalt, then you need to do **str(u'unicode here')**. But again if even ASCII won't work, the problem lies somewhere else – Anzel Jan 07 '15 at 01:33
  • Welp I removed every convertion from string to unicode from scrapy and it now generates a completely valid BSON, I can copy and paste to mongo shell and insert fine, but it is not possible to create BSON document with python yet. – ranieri Jan 07 '15 at 02:44