The problem you encountered, I believe is due to the escaped characters not fully converted to utf-8 format before inserting into mongoDB from Python.
I haven't check MongoDB change log, but if I remember correctly since v.2.2+ full unicode should be supported.
Anyway, you have 2 approaches, upgrade to newer version of mongoDB 2.6, or modify/override your scrapy-mongodb
script. To change the scrapy_mongodb.py
, look at these lines, k isn't converted to utf-8 before inserting into mongodb:
# ... previous code ...
key = {}
if isinstance(self.config['unique_key'], list):
for k in dict(self.config['unique_key']).keys():
key[k] = item[k]
else:
key[self.config['unique_key']] = item[self.config['unique_key']]
self.collection.update(key, item, upsert=True)
# ... and the rest ...
To fix this, you can add this few lines within process_item
function:
# ... previous code ...
def process_item(self, item, spider):
""" Process the item and add it to MongoDB
:type item: Item object
:param item: The item to put into MongoDB
:type spider: BaseSpider object
:param spider: The spider running the queries
:returns: Item object
"""
item = dict(self._get_serialized_fields(item))
# add a recursive function to convert all unicode to utf-8 format
# take this snippet from this [SO answer](http://stackoverflow.com/questions/956867/how-to-get-string-objects-instead-of-unicode-ones-from-json-in-python)
def byteify(input):
if isinstance(input, dict):
return {byteify(key):byteify(value) for key,value in input.iteritems()}
elif isinstance(input, list):
return [byteify(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
# if above utf-8 conversion still not working, replace them completely
# return input.encode('ASCII', 'ignore')
else:
return input
# finally replace the item with this function
item = byteify(item)
# ... rest of the code ... #
If this is still not working, advised to upgrade your mongodb to newer version.
Hope this helps.