3

I use scrapy crawl data and save it to mongodb, i want to save 2dsphere index in mongodb.

Here is my pipelines.py file with scrapy

from pymongo import MongoClient
from scrapy.conf import settings

class MongoDBPipeline(object):

    global theaters
    theaters = []

    def __init__(self):
        connection = MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT'])
        self.db = connection[settings['MONGODB_DB']]
        self.collection = self.db[settings['MONGODB_COLLECTION']]

    def open_spider(self, spider):
        print 'Pipelines => open_spider =>'

    def process_item(self, item, spider):

        global theaters
        # get the class item name to be collection name
        self.collection = self.db[type(item).__name__.replace('_Item','')]

        if  item['theater'] not in theaters:
            print 'remove=>',item['theater']
            theaters.append(item['theater'])
            self.collection.remove({'theater': item['theater']})

        # insert the collection name that is from class object item
        self.collection.insert(dict(item))
        # Here is what i try to create 2dsphere index
        self.collection.create_index({"location": "2dsphere"})

        return item

When i use self.collection.create_index({"location": "2dsphere"})

It shows error TypeError: if no direction is specified, key_or_list must be an instance of list

If i try

self.collection.create_index([('location', "2dsphere")], name='search_index', default_language='english')

There is no error any more , but my mongodb still hasn't any index under location. enter image description here

I think i obey the GeoJson format.

Is any way to save 2dsphere index in mongodb when i using scrapy ? Or should i just save the data like the photo structure and save index by another server file (like nodejs)

Any help would be appreciated. Thanks in advance.

According to Adam Harrison respond, i try to change my mongodb name location to geometry

Than add code import pymongo in my pipelines.py file

and use self.collection.create_index([("geometry", pymongo.GEOSPHERE)])

There is no any error but still can't find the index under geometry enter image description here

Morton
  • 5,380
  • 18
  • 63
  • 118
  • 2
    Possible duplicate of [Does anyone know a working example of 2dsphere index in pymongo?](https://stackoverflow.com/questions/16908675/does-anyone-know-a-working-example-of-2dsphere-index-in-pymongo) – Sohaib Farooqi Mar 09 '18 at 01:11
  • 2
    His question like how to use `2dsphere` index with pymongo not how to save `2dsphere` with pymongo. – Morton Mar 09 '18 at 02:01
  • Check the answer, it address how to specify `2dsphere` index. – Sohaib Farooqi Mar 09 '18 at 02:06
  • I have used the code like his question `self.collection.create_index([("location", "2dsphere")])` and like the answer `self.collection.ensure_index([("location", "2dsphere")])` , both of them don't show any error , but my mongodb still no any index has been created. Can't figure it out :( – Morton Mar 09 '18 at 02:19
  • 3
    Try `collection.create_index([("geometry", pymongo.GEOSPHERE)])`. Taken from the ticket bro-grammer linked. See http://api.mongodb.com/python/current/api/pymongo/collection.html for more info – Adam Harrison Mar 09 '18 at 05:14
  • 1
    Thanks for your reply, i add `import pymongo` and change the key name `location` to `geometry` then i use `self.collection.create_index([("geometry", pymongo.GEOSPHERE)])` . It stills no error information and index on my mongodb. – Morton Mar 09 '18 at 06:17
  • @AdamHarrison I look into the document , its sounds like an solution, but no working for me . Don't know why – Morton Mar 09 '18 at 07:20
  • @AdamHarrison Tahnks for your reply, i try the code and its woring. – Morton Mar 15 '18 at 05:53
  • I was wondering if it was possible to do a "input_processor"/"output_processor" in the file "item.py" to do this job. – Ericksan Pimentel Dec 24 '21 at 16:28

1 Answers1

0

For me it was necessary to use the ItemAdapter to convert the Item parameter into a list. So I was able to query the database.

from itemadapter import ItemAdapter, adapter
import pymongo
from scrapy.exceptions import DropItem

collection_name = 'myCollection'
    
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

def open_spider(self, spider):
    self.client = pymongo.MongoClient(self.mongo_uri)
    self.db = self.client[self.mongo_db]

def close_spider(self, spider):
    self.client.close()

The process_item function:

def process_item(self, item, spider):
    adapter = ItemAdapter(item)
    if self.db[self.collection_name].find_one({'id':adapter['id']}) != None:
        dado = self.db[self.collection_name].find_one_and_update({'id':adapter['id']})
        ## ----> raise DropItem(f"Duplicate item found: {item!r}") <------
        print(f"Duplicate item found: {dado!r}")
    else:
        self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
    return item