0

I have implemented bulk indexing. I'd like to make it more efficient.

# current implementation in Python

def products_to_index():
    for product in all_products():
        yield {
            "_op_type": "index",
            "_index": INDEX_NAME,
            "_id": product.id,
            "_source": {"name": product.name, "content": product.content},
        }


def main(args):
    # Connect to localhost:9200 by default.
    es = Elasticsearch()
    body = ANALYZER  
    
    es.indices.create(index=INDEX_NAME, body=body)

    bulk(es, products_to_index())

This implementation seems to just take all the data and index them batch by batch. I'd like to implement an additional step to check whether this entry has already been indexed.

I also thought about loading from the path of saved indices locally. Not sure how to proceed.

I looked at the API documentation, but I can't find any.

proguorammer
  • 85
  • 2
  • 8

1 Answers1

0

By using index you tell to elasticsearch that I want to index this document and if it is exists update it. But if you use create type with specific id, you allow elasticsearch for "put-if-absent" manner. When you use bulk API, your response will show the result of each document separately, and you can know that which document inserted and which document not. for this purpose, it is only required to set your op_type as create.

Saeed Nasehi
  • 940
  • 1
  • 11
  • 27