Elasticsearch clients for python, no solution

Question

I am having a very bad week having chosen elasticsearch with graylog2. I am trying to run queries against the data in ES using Python.

I have tried following clients.

ESClient - Very weird results, I think its not maintained, query_body has no effect it returns all the results.
Pyes - Unreadable, undocumented. I have browsed sources and cant figure out how to run a simple query, maybe i am just not that smart. I can even run base queries in json format and then simply use the Python object/iterators to do my analysis on the results. But Pyes does not make it easy.
Elasticutils - Another documented, but without a complete sample. I get the following error with code attached. I don't even know how it uses this S() to connect to the right host?

es = get_es(hosts=HOST, default_indexes=[INDEX])

basic_s = S().indexes(INDEX).doctypes(DOCTYPE).values_dict()

results:

 print basic_s.query(message__text="login/delete")
  File "/usr/lib/python2.7/site-packages/elasticutils/__init__.py", line 223, in __repr__
    data = list(self)[:REPR_OUTPUT_SIZE + 1]
  File "/usr/lib/python2.7/site-packages/elasticutils/__init__.py", line 623, in __iter__
    return iter(self._do_search())
  File "/usr/lib/python2.7/site-packages/elasticutils/__init__.py", line 573, in _do_search
    hits = self.raw()
  File "/usr/lib/python2.7/site-packages/elasticutils/__init__.py", line 615, in raw
    hits = es.search(qs, self.get_indexes(), self.get_doctypes())
  File "/usr/lib/python2.7/site-packages/pyes/es.py", line 841, in search
    return self._query_call("_search", body, indexes, doc_types, **query_params)
  File "/usr/lib/python2.7/site-packages/pyes/es.py", line 251, in _query_call
    response = self._send_request('GET', path, body, querystring_args)
  File "/usr/lib/python2.7/site-packages/pyes/es.py", line 208, in _send_request
    response = self.connection.execute(request)
  File "/usr/lib/python2.7/site-packages/pyes/connection_http.py", line 167, in _client_call
    return getattr(conn.client, attr)(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/pyes/connection_http.py", line 59, in execute
    response = self.client.urlopen(Method._VALUES_TO_NAMES[request.method], uri, body=request.body, headers=request.headers)
  File "/usr/lib/python2.7/site-packages/pyes/urllib3/connectionpool.py", line 294, in urlopen
    return self.urlopen(method, url, body, headers, retries-1, redirect) # Try again
  File "/usr/lib/python2.7/site-packages/pyes/urllib3/connectionpool.py", line 294, in urlopen
    return self.urlopen(method, url, body, headers, retries-1, redirect) # Try again
  File "/usr/lib/python2.7/site-packages/pyes/urllib3/connectionpool.py", line 294, in urlopen
    return self.urlopen(method, url, body, headers, retries-1, redirect) # Try again
  File "/usr/lib/python2.7/site-packages/pyes/urllib3/connectionpool.py", line 294, in urlopen
    return self.urlopen(method, url, body, headers, retries-1, redirect) # Try again
  File "/usr/lib/python2.7/site-packages/pyes/urllib3/connectionpool.py", line 255, in urlopen
    raise MaxRetryError("Max retries exceeded for url: %s" % url)
pyes.urllib3.connectionpool.MaxRetryError: Max retries exceeded for url: /graylog2/message/_search

I wish the devs of this good projects would provide some complete examples. Even looking at sources I am t a complete loss.

Is there any solution, help out there for me with elasticsearch and python or should I just drop all of this and pay for a nice splunk account and end this misery.

I am proceeding with using curl, download the entire json result and json load it. Hope that works, though curl downloading 1 million messages from elasticsearch may not just happen.

I agree...I am having a tough time trying to get pyes to work..with very little support..I dont think it is right for me to blame the developers...I guess ES as a whole is new and I just have to have more patience :) — Abhi, Aug 22 '12 at 21:29
I am not blaming anyone here. I just find that docs are lacking and it's hard to contribute. — Abhishek Dujari, Aug 27 '12 at 12:29
Why on earth would you want a library when the REST API is so well documented? — Slater Victoroff, Jun 27 '13 at 18:53

score 8 · Answer 1 · answered Sep 11 '12 at 00:32

8

I have found rawes to be quite usable: https://github.com/humangeo/rawes

It's a rather low-level interface but I have found it to be much less awkward to work with than the high-level ones. It also supports the Thrift RPC if you're into that.

answered Sep 11 '12 at 00:32

Lars Hansson

366
1
3

score 7 · Accepted Answer · answered Aug 03 '12 at 15:01

7

Honestly, I've had the most luck with just CURLing everything. ES has so many different methods, filters, and queries that various "wrappers" have a hard time recreating all the functionality. In my view, it is similar to using an ORM for databases...what you gain in ease of use you lose in flexibility/raw power.

Except most of the wrappers for ES aren't really that easy to use.

I'd give CURL a try for a while and see how that treats you. You can use external JSON formatters to check your JSON, the mailing list to look for examples and the docs are ok if you use JSON.

answered Aug 03 '12 at 15:01

Zach

9,591
1
38
33

yes, i am going that route now, but how do I get ES to give me the complete resultset (all records?) http://stackoverflow.com/questions/8829468/elastic-search-query-to-return-all-records might work. thanks for your input – Abhishek Dujari Aug 04 '12 at 02:46
So you just want all the results, regardless of any query or scoring? The scan/scroll method described in that post would work for you. My dataset is about 200,000 documents and I can scan/scroll through it very quickly. – Zach Aug 04 '12 at 13:50
my data set is 50 million documents, 3million is added every day. I dont want all the docs. I do have a simple text search query to get what I need, this is to reduce to minimum possible important data. I am going to start on this again today and post my results. – Abhishek Dujari Aug 06 '12 at 16:40
Scan/scroll can still work, but realize no documents will be scored. You'll be able to iterate over all the results of a query, but not know which order the results would normally be sorted in. If that's fine with you, then scan/scroll will be much faster than performing a normal query. Barring that, you can use a normal query and paginate through it using "_from" and "_size" – Zach Aug 07 '12 at 14:21
yes I just need the results, sorting, querying can be done outside of ES. The probem I have right now is that elasticsearch just cannot hande this much data when using scan/scroll, it seems to get choked and not except incoming data while doing the scroll. THis is far worse than mysql honestly. – Abhishek Dujari Aug 07 '12 at 18:05
Ah, interesting. I'd ask on the google groups for ES. Shay Banon, the creator of ES, is very active and would probably have a good answer (or fix) for this problem – Zach Aug 07 '12 at 22:34
I was able to resolve using pycurl and toning down my scroll to a lesser number of objects per shard, like 500. and that kind of helped. Anything higher causes ES to stop taking any input data from our log servers. So far things are working good. – Abhishek Dujari Aug 27 '12 at 12:31
Agree that direct `curl`ing makes sense here. The only thing you miss there is robust failover among ES hosts. That's the primary feature I'm looking for in a (thin) ES client library. – ron rothman Sep 26 '13 at 13:42

score 7 · Answer 3 · edited Nov 14 '12 at 00:21

7

Explicitly setting the host resolved that error for me:

basic_s = S().es(hosts=HOST, default_indexes=[INDEX])

edited Nov 14 '12 at 00:21

Jamey Sharp

8,363
2
29
42

answered Nov 13 '12 at 23:54

tentpole

71
1
5

score 4 · Answer 4 · answered Aug 06 '12 at 22:12

4

FWIW, PYES docs are here: http://packages.python.org/pyes/index.html

Usage: http://packages.python.org/pyes/manual/usage.html

answered Aug 06 '12 at 22:12

dan

121
5

thank you. this is definitely helpful. please could you add this link to the github readme. ty very much. – Abhishek Dujari Aug 07 '12 at 04:54
I'm not its maintainer, but I'll open a ticket :) – dan Aug 07 '12 at 14:12

Holly · Answer 5 · 2015-01-16T04:18:56.207

ElasticSearch recently (Sept 2013) released an official Python client elasticsearch-py (elasticsearch on PyPI, also on github), which is supposed to be a fairly direct mapping to the official ElasticSearch API. I haven't used it yet, but it looks promising, and at least it will match the official docs!

Edit: We started using it, and I'm very happy with it. ElasticSearch's API is pretty clean, and elasticsearch-py maintains that. Easier to work with and debug in general, plus decent logging.

score 2 · Answer 6 · answered Aug 06 '12 at 19:14

2

ElasticUtils has sample code: http://elasticutils.readthedocs.org/en/latest/sampleprogram1.html

If there are other things you need in the docs, just ask.

answered Aug 06 '12 at 19:14

user1580130

21
1

yes i have tried that and posted my error above. I dont need to add index etc, i just need result of query using the method as shown in the sample, but it fails as per my Q. any ideas on what I'm doing wrong? – Abhishek Dujari Aug 06 '12 at 19:45
MaxRetryError suggests it wasn't able to connect to the host you specified. So either HOST is set wrong or ElasticSearch isn't running. – user1580130 Aug 19 '12 at 23:56
Syntax might be the problem. I found that between versions 0.6 and 0.7 there were changes in how to state the place of your server. Once it was just a host, now it is a URLs (or similar). Maybe you've run into the same issue. I solved it by installing the newest version of ElasticUtils directly from github. – Alfe May 09 '13 at 01:11

Elasticsearch clients for python, no solution

6 Answers6