I have a Spark application with records which contain the following information:
- Hash - Some unique identifier for an item
- Location - The location of the item
- From - The date on which the item was first seen in location
- To - Null if still there or a date if item has ceased to be in location
I only need to be able to ask this one question:
Where was item X at datetime Y
What is the most efficient possible way of indexing this information for very fast lookups? Lets say I have 10's of billions of records per day that contain item hashes and I need to enrich those records with their locations.
My simplistic approach would be to store the records as above, partitioned by hash (although there are probably about 10 million of them) and to join on my larger data source where hash is the same and where date > From and date < To. This comparison of dates though feels like it should have a better solution.
Any suggestions on how this data could be stored in a better more efficient way?