I have an index with around 400 milllion documents stored. The fields in general look like this: systemname, filename, timestamp, message, version, etc,
What I want to get is the first and last document (based on the timestamp) per filename.
What I am doing right now is:
Query the whole index
Define window based on filename
Sort ascending
Add rownumber
Add max_rownumber
Filter all entries where rownumber = 1 or rownumber = max_rownumber
This is really really slow
Code can be seen here: Getting first and last entry of window
ES Version: 6.0.0 ES Spark Connector: elasticsearch-spark-20 6.0.0 Scala Version 2.11.8 Spark Version: 2.2
Does someone have any idea on how I could speed this up?
The best way would probably be to make the I/O smaller but elasticsearch does not feature aggregations with spark. I also found no way to query only specific fields of a document. For this example querying only 5 out of the 10 fields per document would be enough.
This calculation is taking over 15 hours on a cluster with 112 cores.
If someone could help me with a solution that would be really nice. I am out of Options / Ideas. Not being able to query aggregations seems to become a dealbreaker for elasticsearch but I don't want to switch away.