0

I am indexing data on elasticsearch using the official python library for this: elasticsearch-py. The data is directly taken from oracle using the cx_oracle python library, cast into a document format and send for indexing to elasticsearch. For the most part this works great, but sometimes I encounter problems with characters like ö. Sometimes this character is indexed as \xc3\xb8 and sometimes as ö. This happens even in the same database entry. One variable can have the ö indexed correct while for another variable this is not the case.

Does Anyone an idea what might cause this?

thanks in advance

Sleenee
  • 594
  • 1
  • 8
  • 21

1 Answers1

1

If your "ö" is sometimes right - and sometimes not, the data must be corrupted in your database. This is not a problem of Elasticsearch. (I had the exact same problem one month ago!)

Strings with various encodings are likely put in your database without being all converted to a single format before.

text = "ö"
asUtf=text.encode('UTF-8')
print(asUtf)
print(asUtf.decode())

Result:

b'\xc3\xb6'

ö

This problem could be solved before the insertion into Elasticsearch. Find the text sequences matching '\xXX\xXX', treat them as UTF-8 and decode them to unicode. Try to sanitize you database and fix the way you put information inside.

PS: a better practice to move information from a database to Elasticsearch is to use rivers or to make a script that would directly send the data to Elasticsearch, without saving them into a file first.

2016 edit: the rivers are deprecated now, so you should find an alternative like logstash.

Community
  • 1
  • 1
Heschoon
  • 2,915
  • 9
  • 26
  • 55
  • On clicking [rivers](https://www.elastic.co/guide/en/elasticsearch/rivers/current/index.html), it says **Rivers were deprecated in Elasticsearch 1.5 and removed in Elasticsearch 2.0.** Just out of curiosity, any other way to send data directly to ES without a script? – Sameer Mirji Feb 02 '16 at 05:36
  • Since the rivers are deprecated, you'll have to use an alternative like logstash. Have a look on this: http://stackoverflow.com/questions/29674974/alternatives-to-elasticsearch-river-plugins/ – Heschoon Feb 02 '16 at 10:58