Indexing and percolating documents with elasticsearch-dsl-py

Question

I'm making and investigation for a seminar in retrieval information. I have a json file with a list of articles and i need to index them and after use a percolator with highlighting.

The list of steps for do this in terminal is this:
1. Create a map with percolating.

curl -XPUT 'localhost:9200/my-index?pretty' -H 'Content-Type: application/json' -d'
{
    "mappings": {
        "_doc": {
            "properties": {
                "title": {
                    "type": "text"
                },
                "query": {
                    "type": "percolator"
                }
            }
        }
    }
}
'

Index a new article:

curl -XPUT 'localhost:9200/my-index/_doc/1?refresh&pretty' -H 'Content-Type: application/json' -d'
{           
    "CourseId":35,
      "UnitId":12390,
      "id":"16069",
      "CourseName":"ARK102U_ARKEOLOJİK ALAN YÖNETİMİ",
      "FieldId":8,
      "field":"TARİH",
    "query": {
        "span_near" : {
            "clauses" : [
                { "span_term" : { "title" : "dünya" } },
                { "span_term" : { "title" : "mirası" } },
                { "span_term" : { "title" : "sözleşmesi" } }
            ],
            "slop" : 0,
            "in_order" : true
        }
    }
}
'

Percolate a documment:

curl -XGET 'localhost:9200/my-index/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query" : {
        "percolate" : {
            "field" : "query",
            "document" : {
                "title" : "Arkeoloji, arkeolojik yöntemlerle ortaya çıkarılmış kültürleri, dünya mirası sözleşmesi sosyoloji, coğrafya, tarih, etnoloji gibi birçok bilim dalından yararlanarak araştıran ve inceleyen bilim dalıdır. Türkçeye yanlış bir şekilde> \"kazıbilim\" olarak çevrilmiş olsa da kazı, arkeolojik araştırma yöntemlerinden sadece bir tanesidir."
            }
        }
    },

    "highlight": {
      "fields": {
        "title": {}
      }
    }
}
'

I have this code until now:

import json
from elasticsearch_dsl import (
DocType,
Integer,
Percolator,
Text,
)

# Read the json File
json_data = open('titles.json').read()
data = json.loads(json_data)

docs = data['response']['docs']

# Creating a elasticsearch connection
# connections.create_connection(hosts=['localhost'], port=['9200'], timeout=20)
"""
curl -XPUT 'localhost:9200/my-index?pretty' -H 'Content-Type: application/json' -d'
{
    "mappings": {
        "_doc": {
            "properties": {
                "title": {
                    "type": "text"
                },
                "query": {
                    "type": "percolator"
                }
            }
        }
    }
}
'

"""

class Documment(DocType):
    course_id = Integer()
    unit_id = Integer()
    # title = Text()
    id = Integer()
    course_name = Text()
    field_id = Integer()
    field = Text()


    class Meta:
        index = 'titles_index'


                properties={
                    'title': Text(),
                    'query': Percolator()
                 }

"""
    "query": {
        "span_near" : {
            "clauses" : [
                { "span_term" : { "title" : "dünya" } },
                { "span_term" : { "title" : "mirası" } },
                { "span_term" : { "title" : "sözleşmesi" } }
            ],
            "slop" : 0,
            "in_order" : true
        }
    }

"""

for doc in docs:

    terms = docs['title'].split(“ ”)
    course_id = docs['CourseId']
    unit_id = docs['UnitId']
    id = docs['id']
    course_name = docs['CourseName']
    field_id = docs['FieldId']
    field = docs['field']

UPDATE: Thank you for the answer, i have this now:

import json

from elasticsearch_dsl import (
    connections,
    DocType,
    Mapping,
    Percolator,
    Text
)
from elasticsearch_dsl.query import (
    SpanNear,
    SpanTerm
)
from elasticsearch import Elasticsearch

# Read the json File
json_data = open('titles.json').read()
data = json.loads(json_data)

docs = data['response']['docs']


# creating a new default elasticsearch connection
connections.configure(
    default={'hosts': 'localhost:9200'},
)


class Document(DocType):
    title = Text()
    query = Percolator()

    class Meta:
        index = 'title-index'
        doc_type = '_doc'

    def save(self, **kwargs):
        return super(Document, self).save(**kwargs)


# create the mappings in elasticsearch
Document.init()

# index the query
for doc in docs:
    terms = doc['title'].split(" ")
    clauses = []
    for term in terms:
        field = SpanTerm(title=term)
        clauses.append(field)
    query = SpanNear(clauses=clauses)
    item = Document(title=doc['title'],query=query)
    item.save()

It is working fine, but i have two goals now:

I'm getting the next error after indexing a randome number of items in the dict:

elasticsearch.exceptions.AuthorizationException: TransportError(403, 
'cluster_block_exception', 'blocked by: [FORBIDDEN/12/index read-only 
/ allow delete (api)];')

I know i can solve this problem using this command: curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

UPDATE Finally i solved it deleting the data folder.

But now i'm making search in the index and i don't get anything:

>>> text='Arkeoloji, arkeolojik yöntemlerle ortaya çıkarılmış kültürleri, dünya mirası sözleşmesi sosyoloji, coğrafya, tarih, etnoloji gibi birçok bilim dalından yararlanarak araştıran ve inceleyen bilim dalıdır. Türkçeye yanlış bir şekilde> \"kazıbilim\" olarak çevrilmiş olsa da kazı, arkeolojik araştırma yöntemlerinden sadece bir tanesidir.'
>>> s = Search().using(client).query("percolate", field='query', document={'title': text}).highlight('title')
>>> print(s.to_dict())
{'query': {'percolate': {'field': 'query', 'document': {'title': 'Arkeoloji, arkeolojik yöntemlerle ortaya çıkarılmış kültürleri, dünya mirası sözleşmesi sosyoloji, coğrafya, tarih, etnoloji gibi birçok bilim dalından yararlanarak araştıran ve inceleyen bilim dalıdır. Türkçeye yanlış bir şekilde> "kazıbilim" olarak çevrilmiş olsa da kazı, arkeolojik araştırma yöntemlerinden sadece bir tanesidir.'}}}, 'highlight': {'fields': {'title': {}}}}
>>> response = s.execute()
>>> response
<Response: {}>

And this is my trying with curl:

 curl -XGET 'localhost:9200/title-index/_search?pretty' -H 'Content-Type: application/json' -d '{  
    "query" : {        
        "percolate" : {       
            "field" : "query",
            "document" : {
                "title" : "Arkeoloji, arkeolojik yöntemlerle ortaya çıkarılmış kültürleri, dünya mirası sözleşmesi sosyoloji, coğrafya, tarih, etnoloji gibi birçok bilim dalından yararlanarak araştıran ve inceleyen bilim dalıdır. Türkçeye yanlış bir şekilde> \"kazıbilim\" olarak çevrilmiş olsa da kazı, arkeolojik araştırma yöntemlerinden sadece bir tanesidir."
            }
        }
    },            
    "highlight": {
      "fields": {  
        "title": {}
      }
    }
}'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

I'm getting variable stats but not results:

>>> response.to_dict()
{'took': 9, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}}
>>> response
{'took': 12, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}}
>>> response
{'took': 12, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}}

Can anyone help me?

Val · Accepted Answer · 2018-03-28T10:47:32.630

4

The first step is correct, i.e. the mapping is correct. But then, you need to first index a query, that's the whole point of percolation. So let's index your query:

curl -XPUT 'localhost:9200/my-index/_doc/my-span-query?refresh&pretty' -H 'Content-Type: application/json' -d '{           
    "query": {
        "span_near" : {
            "clauses" : [
                { "span_term" : { "title" : "dünya" } },
                { "span_term" : { "title" : "mirası" } },
                { "span_term" : { "title" : "sözleşmesi" } }
            ],
            "slop" : 0,
            "in_order" : true
        }
    }
}'

Then the idea is to find out which query would match the document you're percolating, so let's percolate a document:

curl -XGET 'localhost:9200/my-index/_search?pretty' -H 'Content-Type: application/json' -d '{
    "query" : {
        "percolate" : {
            "field" : "query",
            "document" : {
                "title" : "Arkeoloji, arkeolojik yöntemlerle ortaya çıkarılmış kültürleri, dünya mirası sözleşmesi sosyoloji, coğrafya, tarih, etnoloji gibi birçok bilim dalından yararlanarak araştıran ve inceleyen bilim dalıdır. Türkçeye yanlış bir şekilde> \"kazıbilim\" olarak çevrilmiş olsa da kazı, arkeolojik araştırma yöntemlerinden sadece bir tanesidir."
            }
        }
    },
    "highlight": {
      "fields": {
        "title": {}
      }
    }
}'

And you would get this response with highlighting where you can see that my-span-query matches the given document:

{
  "took": 104,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.8630463,
    "hits": [
      {
        "_index": "my-index",
        "_type": "_doc",
        "_id": "my-span-query",
        "_score": 0.8630463,
        "_source": {
          "query": {
            "span_near": {
              "clauses": [
                {
                  "span_term": {
                    "title": "dünya"
                  }
                },
                {
                  "span_term": {
                    "title": "mirası"
                  }
                },
                {
                  "span_term": {
                    "title": "sözleşmesi"
                  }
                }
              ],
              "slop": 0,
              "in_order": true
            }
          }
        },
        "fields": {
          "_percolator_document_slot": [
            0
          ]
        },
        "highlight": {
          "title": [
            "Arkeoloji, arkeolojik yöntemlerle ortaya çıkarılmış kültürleri, <em>dünya</em> <em>mirası</em> <em>sözleşmesi</em> sosyoloji, coğrafya"
          ]
        }
      }
    ]
  }
}

UPDATE

The same thing using elasticsearch-py-dsl:

from elasticsearch_dsl import DocType, Text, Percolator
from elasticsearch import Elasticsearch

class Document(DocType):
    title = Text()
    query = Percolator()

    class Meta:
        index = 'my-index'

    def save(self, ** kwargs):
        return super(Document, self).save(** kwargs)

# 1a. create the mappings in elasticsearch
Document.init()

# 1b. or another alternative way of saving the mapping
query_mapping = elasticsearch_dsl.Mapping('_doc')
query_mapping.field('title', 'text')
query_mapping.field('query', 'percolator')
query_mapping.save('my-index')

# 2. index the query
query = Document(query={...your span query here...})
query.save()

# 3. send the percolate query
client = Elasticsearch()
response = client.search(
    index="my-index",
    body={
      "query" : {
        "percolate" : {
            "field" : "query",
            "document" : {
                "title" : "Arkeoloji, arkeolojik yöntemlerle ortaya çıkarılmış kültürleri, dünya mirası sözleşmesi sosyoloji, coğrafya, tarih, etnoloji gibi birçok bilim dalından yararlanarak araştıran ve inceleyen bilim dalıdır. Türkçeye yanlış bir şekilde> \"kazıbilim\" olarak çevrilmiş olsa da kazı, arkeolojik araştırma yöntemlerinden sadece bir tanesidir."
            }
        }
    },
    "highlight": {
      "fields": {
        "title": {}
      }
    }
  }
)

UPDATE 2

There's is no reason to also store the title along with the query, you only need to store the query, so your code should look like this instead:

# index the query
for doc in docs:
    terms = doc['title'].split(" ")
    clauses = []
    for term in terms:
        field = SpanTerm(title=term)
        clauses.append(field)
    query = SpanNear(clauses=clauses)
    item = Document(query=query)         <-- change this line
    item.save()

edited Mar 28 '18 at 10:47

answered Mar 28 '18 at 05:42

Val

207,596
13
358
360

I'm sorry, but i know how do this; i need do the same thing but using `elastic-search-dsl-py`. – SalahAdDin Mar 28 '18 at 06:04
Well, the second step (i.e. indexing the query) was not correct and could not have worked that way. Anyway, let me update my answer and add some python in there – Val Mar 28 '18 at 06:14
I've added the comments before each step, it's pretty straightforward and does the same as each curl command above, 1) create the mapping for the percolator field, 2) index the query, 3) percolate a sample document. – Val Mar 28 '18 at 07:06
How can i index all documents in my dictionary? – SalahAdDin Mar 28 '18 at 07:15
I got it, whit the second way i get a properly mapping for `_doc`, but with the first documment i'm getting a mapping for `doc`, why? – SalahAdDin Mar 28 '18 at 09:00
`doc` is the default type name used by the library, see here: https://github.com/elastic/elasticsearch-dsl-py/blob/762f0cd5dfed0bd0f1e338bc508f9ac0c226cffe/docs/persistence.rst#class-meta-options You can change that in the `Meta` class. It doesn't matter much as long as you're using the same type everywhere. It could be `foobar` for what it's worth. – Val Mar 28 '18 at 09:32
That's right, thanks. Now i'm building the query using the library's types: `for doc in docs: terms = doc['title'].split(" ") query = SpanNear() for term in terms: field = SpanTerm(field=term)` – SalahAdDin Mar 28 '18 at 09:47
DO you know how can i use this instead a dict? – SalahAdDin Mar 28 '18 at 09:47
Adding code to comments is not really legible, you should probably update your question instead. – Val Mar 28 '18 at 09:49
Ok, i updated my answer, please add the new addtions to your answer, and if we can solve the final problem, i will accept your answer. Thank you very much!!! – SalahAdDin Mar 28 '18 at 10:16
I've updated my answer with a slight modification to your code. Regarding the error you're getting, I think this is because your hard disk is almost full, can that be? – Val Mar 28 '18 at 10:47
There you go, increase the disk space and/or make some room and you should be fine going forth. – Val Mar 28 '18 at 10:56
Now it does not works, init the `Document` i get that problem: connection problem. – SalahAdDin Mar 28 '18 at 11:54
connection problem means that either your ES is down or you have networking issues – Val Mar 28 '18 at 11:57
I haven't this problem before, now i can not put any element in the index. – SalahAdDin Mar 28 '18 at 12:08
what do you get if you run this `curl -XGET localhost:9200`? – Val Mar 28 '18 at 12:23
Yeah, your disk was full and ES wouldn't start – Val Mar 28 '18 at 12:28
Update your code please, it's needed add the title field to the index: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html#_percolate_query_and_highlighting – SalahAdDin Mar 28 '18 at 17:46
The title field is already in the mapping, it is sufficient, but there is no need to add the title field to the query document you're indexing. – Val Mar 28 '18 at 18:05
So, why i can't getting any result in the search? – SalahAdDin Mar 28 '18 at 18:12
How about using `Search(using=client, index="my-index")` instead? You're not specifying any index in your percolate query. – Val Mar 28 '18 at 18:37
I did it with curl, it was the same result. – SalahAdDin Mar 28 '18 at 18:47
Do you mean my curl commands? I've executed them all through Kibana and that's how I got the result I showed you in the answer. I'll try the curls a bit later, maybe there's something I overlooked. – Val Mar 29 '18 at 04:15
Did you repeat all curl commands from the beginning or only the one for searching/percolating? It is important to test this thoroughly and not a few steps in python and then a few steps in curl. – Val Mar 29 '18 at 11:30
I'm testing the query for percolation, because the other querys it suppose i did with the previous code. – SalahAdDin Mar 29 '18 at 15:03
I tried with dictionary, with curl, with all, i get the complete index, but i doesn't not take all index for work, he only takes 12 registers. – SalahAdDin Mar 29 '18 at 15:08
I updated the question again. I can't understand why the searching doesn't talk all documents in the index. – SalahAdDin Mar 30 '18 at 17:00
What's the problem here? – SalahAdDin Apr 04 '18 at 09:52
I have to dive into this again, for me all the curl commands worked, I haven't tested the python code though. – Val Apr 04 '18 at 11:11
I tryed adding the size value but it still doesn't now works: https://stackoverflow.com/questions/8829468/elasticsearch-query-to-return-all-records – SalahAdDin Apr 08 '18 at 14:57
I created a new question here: https://stackoverflow.com/questions/49719893/searching-along-more-than-180000-documents-index-does-not-return-any-results – SalahAdDin Apr 08 '18 at 16:04
I can't get any result. I a sigle document index i get the answer, but in the full text i can not get it. – SalahAdDin Apr 17 '18 at 20:02
```{'took': 9, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}} >>> ``` – SalahAdDin Apr 17 '18 at 20:07

Indexing and percolating documents with elasticsearch-dsl-py

1 Answers1