Spark2 Datetime lookup efficient datastructure

Question

I have a Spark application with records which contain the following information:

Hash - Some unique identifier for an item
Location - The location of the item
From - The date on which the item was first seen in location
To - Null if still there or a date if item has ceased to be in location

I only need to be able to ask this one question:

Where was item X at datetime Y

What is the most efficient possible way of indexing this information for very fast lookups? Lets say I have 10's of billions of records per day that contain item hashes and I need to enrich those records with their locations.

My simplistic approach would be to store the records as above, partitioned by hash (although there are probably about 10 million of them) and to join on my larger data source where hash is the same and where date > From and date < To. This comparison of dates though feels like it should have a better solution.

Any suggestions on how this data could be stored in a better more efficient way?

partition by 10million records is not good – loneStar Aug 28 '18 at 14:56 — loneStar, Aug 28 '18 at 14:56

Spark2 Datetime lookup efficient datastructure

0 Answers0