1

I am currenly working on something where i connect to an Elasticsearch server/database/cluster, whatever the technical term is, and my goal is to grab all the logs in the last 24 hours for parsing. I can grab logs right now but it only grabs a max of 10,000. For reference, within the last 24 hours there have been about 10 million logs total within the database that I am using.

For the python, I make a http request to elasticsearch using the requests library. My current query only has the paramteter size = 10,000.

I am wondering what method/ what query to use for this case? I have seen things about a scroll id or point in time API, but i am not sure what is the best for my case since there are so many logs.

I have just tried increasing the size to a lot more but that does not work well since there are so many logs and it errors out.

R Lyon
  • 35
  • 3
  • which are the available parameters for the elasticsearch query and which field does the response have? i believe that you might have a paginate information stating how many logs you have for that request and you can infere how many requests you need to do; therefore, using a `while` and exploring the elasticsearch parameters you can achieve that. if you have a problem with performance, i sugest trying to implement parallel requests, but that should only be done afterwards you work out very well the parameters and limitations of the server. – kovashikawa Jun 19 '23 at 14:16
  • I think the available parameters for the query are any of the querys that you can do within the elasticsearch curl or im not exactly sure what you are asking, new to this. I am thinking of keeping the size at 10,000 and then adding a query that limits what the time can be between and maybe using a while loop to keep getting until there are no more. – R Lyon Jun 19 '23 at 14:23
  • first please, include in the question a simplified example of the response from your search, then try to explore the parameters for the query. in the response, it probably has a total number of logs for that search (L) and the number of logs in the response (n). by doing L/n you have the number of pages for that query that you need to request. see this thread for further info on this topic: https://stackoverflow.com/questions/59105657/what-is-the-best-approach-for-elasticsearch-pagination – kovashikawa Jun 19 '23 at 15:33

1 Answers1

1

use the scroll API, it was designed for your use-case.

The Scroll API is no longer encouraged for deep pagination, however, if you are running an internal (logging) application, the performance impacts of scroll shouldn't be an issue as you will not have many queries to serve.

Many Elasticsearch deployments focused on logging use Index Lifecycle Policies, to create a new index each day (e.g. my-logs-2023-06-20), and the logs are ingested into that index automatically. Once the day is over, the index would be made read-only, and you could automatically migrate the index to colder tiers with a reduced storage cost.

Here's an example ILM policy you may want to consider.

If hundreds of indices sounds like a nightmare, don't worry, you can create an alias so you could query all-my-logs to search all the indices.

fucalost
  • 313
  • 1
  • 8