4

I am trying to do semantic search with Elasticsearch using tensorflow_hub, but I get RequestError: RequestError(400, 'search_phase_execution_exception', 'runtime error') . From search_phase_execution_exception I suppose that with corrupted data(from this stack question) My document structure looks like this

{
"settings": {
  "number_of_shards": 2,
  "number_of_replicas": 1
},
 "mappings": {
  "dynamic": "true",
  "_source": {
    "enabled": "true"
  },
  "properties": {
        "id": {
            "type":"keyword"
        },
        "title": {
            "type": "text"
        },
        "abstract": {
            "type": "text"
        },
        "abs_emb": {
            "type":"dense_vector",
            "dims":512
        },
        "timestamp": {
            "type":"date"
        }
    }
}
}

And I create a document using elasticsearch.indices.create.

es.indices.create(index=index, body='my_document_structure')
res = es.indices.delete(index=index, ignore=[404])
for i in range(100):
  doc = {
    'timestamp': datetime.datetime.utcnow(),
    'id':id[i],
    'title':title[0][i],
    'abstract':abstract[0][i],
    'abs_emb':tf_hub_KerasLayer([abstract[0][i]])[0]
  }
  res = es.index(index=index, body=doc)

for my semantic search I use this code

query = "graphene" query_vector = list(embed([query])[0])

script_query = {
    "script_score": {
        "query": {"match_all": {}},
        "script": {
            "source": "cosineSimilarity(params.query_vector, doc['abs_emb']) + 1.0",
            "params": {"query_vector": query_vector}
        }
    }
}

response = es.search(
    index=index,
    body={
        "size": 5,
        "query": script_query,
        "_source": {"includes": ["title", "abstract"]}
    }
)

I know there are some similar questions in stackoverflow and elsasticsearch, but I couldn't find solution for me. My guess is that the document structure is wrong but I can't figure out what exactly. I used search query code from this repo. The full error message is too long and doesn't seem to contain much information, so I share only last part of it.

~/untitled/elastic/venv/lib/python3.9/site-packages/elasticsearch/connection/base.py in 
_raise_error(self, status_code, raw_data)
320             logger.warning("Undecodable raw error response from server: %s", err)
321 
--> 322         raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
323             status_code, error_message, additional_info
324         )

RequestError: RequestError(400, 'search_phase_execution_exception', 'runtime error')

An here is the Error from Elasticsearch server.

[2021-04-29T12:43:07,797][WARN ][o.e.c.r.a.DiskThresholdMonitor] 
[asmac.local] high disk watermark [90%] exceeded on 
[w7lUacguTZWH9xc_lyd0kg][asmac.local][/Users/username/elasticsearch- 
7.12.0/data/nodes/0] free: 17.2gb[7.4%], shards will be relocated 
away from this node; currently relocating away shards totalling [0] 
bytes; the node is expected to continue to exceed the high disk 
watermark when these relocations are complete
Armen Sanoyan
  • 1,898
  • 2
  • 19
  • 32

3 Answers3

3

I think you're hitting the following issue and you should update your query to this:

script_query = {
    "script_score": {
        "query": {"match_all": {}},
        "script": {
            "source": "cosineSimilarity(params.query_vector, 'abs_emb') + 1.0",
            "params": {"query_vector": query_vector}
        }
    }
}

Also make sure that query_vector contains floats and not doubles

Val
  • 207,596
  • 13
  • 358
  • 360
  • I have checked the type and it was numpy.float32 so didn't seem like that was the case. I also updated the code from doc['abs_emb'] to 'abs_emb' , but I still get same error. – Armen Sanoyan Apr 29 '21 at 09:08
  • Ok, I'm pretty sure you should find the error in the ES logs somewhere... Do you have multiple nodes? If not, can you maybe [increase the log level](https://elasticsearch-py.readthedocs.io/en/v7.12.0/index.html?highlight=logging#logging) of your Python client to dump the error in your client code logs – Val Apr 29 '21 at 09:10
  • I will check it. Hope this google colab can help to find the problem https://colab.research.google.com/drive/1eRvDeO73I_Xiap2X2HZOqgkgUGMwzs4m?usp=sharing – Armen Sanoyan Apr 29 '21 at 09:12
  • 1
    no I have just one node and I increased the log level to info still no results. – Armen Sanoyan Apr 29 '21 at 10:05
1

in my case the error was "Caused by: java.lang.ClassCastException: class org.elasticsearch.index.fielddata.ScriptDocValues$Doubles cannot be cast to class org.elasticsearch.xpack.vect ors.query.VectorScriptDocValues$DenseVectorScriptDocValues"

My mistake was - I removed the ES index before starting ingesting content. The one that had the "type":"dense_vector" field.

It caused ES did not use the correct type for indexing dense vectors: they were stored as useless lists of doubles. In this sense the ES index was 'corrupted': all 'script_score' queries returned 400.

Vitaly
  • 11
  • 2
0

For me the issue was I was using dense_vector instead of elastiknn_dense_float_vector which is still open issue. I am converting my vector index to use dense_vector instead: https://github.com/alexklibisz/elastiknn/issues/323

BEWARB
  • 131
  • 1
  • 10