56

How can i get all the results from elastic search as the results only display limit to 10 only. ihave got a query like:

@data = Athlete.search :load => true do
          size 15
          query do
            boolean do
              must { string q, {:fields => ["name", "other_names", "nickname", "short_name"], :phrase_slop => 5} }
              unless conditions.blank?
                conditions.each do |condition|
                  must { eval(condition) }
                end
              end
              unless excludes.blank?
                excludes.each do |exclude|
                  must_not { eval(exclude) }
                end
              end
            end
          end
          sort do
            by '_score', "desc"
          end
        end

i have set the limit to 15 but i wan't to make it unlimited so that i can get all the data I can't set the limit as my data keeps on changing and i want to get all the data.

Arnab Datta
  • 5,356
  • 10
  • 41
  • 67
Sumit Rai
  • 637
  • 1
  • 6
  • 11

5 Answers5

36

You can use the from and size parameters to page through all your data. This could be very slow depending on your data and how much is in the index.

http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html

Vamsi Krishna
  • 3,742
  • 4
  • 20
  • 45
Zach
  • 9,591
  • 1
  • 38
  • 33
  • 4
    If you *really* want everything, then "If set to 0, the size will be set to Integer.MAX_VALUE.", but pagination is definitely the correct solution. I've seen clusters fall over due to requesting too much in one query. – Wilfred Hughes Jul 08 '14 at 08:51
  • 3
    "'size': 0" didn't work for me and using high integer values (for instance php's max) may result in fatal error. Pagination is definitely the best choice. – Ecter Mar 06 '15 at 08:59
  • 10
    Yeah, `size: 0` will literally return zero results. It is not equivalent to asking for `Integer.MAX_VALUE`. That convention is used [in a few other places](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_size) though, which might be where the confusion comes from. – Zach Mar 06 '15 at 21:41
  • 3
    Agreed, `size : 0` is useful when you do metrics or buckets and want to count how many things are in a bucket w/o returning the actual documents. – travelingbones Apr 28 '16 at 22:26
13

Another approach is to first do a searchType: 'count', then and then do a normal search with size set to results.count.

The advantage here is it avoids depending on a magic number for UPPER_BOUND as suggested in this similar SO question, and avoids the extra overhead of building too large of a priority queue that Shay Banon describes here. It also lets you keep your results sorted, unlike scan.

The biggest disadvantage is that it requires two requests. Depending on your circumstance, this may be acceptable.

David
  • 540
  • 4
  • 14
10

From the docs, "Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000". So my admittedly very ad-hoc solution is to just pass size: 10000 or 10,000 minus from if I use the from argument.

Note that following Matt's comment below, the proper way to do this if you have a larger amount of documents is to use the scroll api. I have used this successfully, but only with the python interface.

travelingbones
  • 7,919
  • 6
  • 36
  • 43
  • It is not the window size that cannot exceed 10000. It's the from + size cannot exceed 10000. Therefore if you had the case where from = 8000 and size = 5000 it would fail even though your window size is less than 10000 by a significant amount. – Matt S Jul 06 '17 at 17:26
  • is the consequence of your comment that my answer should be edited to delete the "or 10,00 minus from ... " to be correct? – travelingbones Jul 10 '17 at 20:43
  • 1
    Well the question is asking how can he get all the data. So your answer doesn't really work here even if you remove your subtraction suggestion. The only answer to the question is to use scrolling if you want to post about that. – Matt S Jul 10 '17 at 22:01
7

use the scan method e.g.

 curl -XGET 'localhost:9200/_search?search_type=scan&scroll=10m&size=50' -d '
 {
    "query" : {
       "match_all" : {}
     }
 }

see here

Rachel Gallen
  • 27,943
  • 21
  • 72
  • 81
  • I can't use scan method as i need to do sorting for above. Is there any other way where i can just collect all the required data – Sumit Rai Jan 18 '13 at 10:28
  • Why do you need ALL the results sorted by relevance? You'll notice that search engines like google don't return more than 1,000 results. That's because the deeper you go, the more work each page requires, esp in a distributed environment. – DrTech Jan 19 '13 at 09:54
  • @DrTech Which would be the right way to approach the updating of records having the value X in the Y field? (could be more that 100k documents) – Marvin Saldinger Aug 29 '14 at 13:59
  • 3
    Note that scan has been deprecated in 2.1: https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_21_search_changes.html#breaking_21_search_changes – Christophe Roussy Jul 12 '16 at 10:29
  • 1
    @ChristopheRoussy answered 3 years ago! but thanks. noted. – Rachel Gallen Jul 12 '16 at 10:40
  • 1
    All scan links are dead... ElasticSearch just wants to forget about it – Eric Hodonsky Dec 22 '17 at 22:48
0

You can use search_after to paginate, and the Point in Time API to avoid having your data change while you paginate. Example with elasticsearch-dsl for Python:

from elasticsearch_dsl.connections import connections

# Set up paginated query with search_after and a fixed point_in_time
elasticsearch = connections.create_connection(hosts=[elastic_host])
pit = elasticsearch.open_point_in_time(index=MY_INDEX, keep_alive="3m")
pit_id = pit["id"]

query_size = 500
search_after = [0]
hits: List[AttrDict[str, Any]] = []
while query_size:
    if hits:
        search_after = hits[-1].meta.sort

    search = (
        Search()
        .extra(size=query_size)
        .extra(pit={"id": pit_id, "keep_alive": "5m"})
        .extra(search_after=search_after)
        .filter(filter_)
        .sort("url.keyword")  # Note you need a unique field to sort on or it may never advance
    )
    response = search.execute()
    hits = [hit for hit in response]

    pit_id = response.pit_id
    query_size = len(hits)
    for hit in hits:
        # Do work with hits
Noumenon
  • 5,099
  • 4
  • 53
  • 73