0

I have an index with around 400 milllion documents stored. The fields in general look like this: systemname, filename, timestamp, message, version, etc,

What I want to get is the first and last document (based on the timestamp) per filename.

What I am doing right now is:

Query the whole index
Define window based on filename
Sort ascending
Add rownumber
Add max_rownumber
Filter all entries where rownumber = 1 or rownumber = max_rownumber
This is really really slow

Code can be seen here: Getting first and last entry of window

ES Version: 6.0.0 ES Spark Connector: elasticsearch-spark-20 6.0.0 Scala Version 2.11.8 Spark Version: 2.2

Does someone have any idea on how I could speed this up?
The best way would probably be to make the I/O smaller but elasticsearch does not feature aggregations with spark. I also found no way to query only specific fields of a document. For this example querying only 5 out of the 10 fields per document would be enough. This calculation is taking over 15 hours on a cluster with 112 cores.

If someone could help me with a solution that would be really nice. I am out of Options / Ideas. Not being able to query aggregations seems to become a dealbreaker for elasticsearch but I don't want to switch away.

user2811630
  • 445
  • 3
  • 11
  • Possible duplicate of https://stackoverflow.com/questions/29169612/elasticsearch-get-docs-with-max-and-min-timestamp-values – NilsH Jan 22 '18 at 10:08
  • @NilsH This is not really a duplicate since this question only refers to elasticsearch where you can use aggregations. Like I wrote in this question the spark-elastic connection does not provide aggregations. What I am looking for is another way to accomplish this result. – user2811630 Jan 22 '18 at 11:32
  • Apologies. I missed that part. And I am also surprised that ES spark does not have that option... – NilsH Jan 22 '18 at 11:56
  • Sad but true :( Seems like I will not be using elasticsearch for long. – user2811630 Jan 22 '18 at 15:08

0 Answers0