I believe a common usage pattern for Hadoop is to build a "data lake" by loading regular (e.g. daily) snapshots of data from operational systems. For many systems, the rate of change from day to day is typically less than 5% of rows (and even when a row is updated, only a few fields may change).
Q: How can such historical data be structured on HDFS, so that it is both economical in space consumption, and efficient to access.
Of course, the answer will depend on how the data is commonly accessed. On our Hadoop cluster:
- Most jobs only read and process the most recent version of the data
- A few jobs process a period of historical data (e.g. 1 - 3 months)
- A few jobs process all available historical data
This implies that, while keeping historical data is important, it shouldn't come at the cost of severely slowing down those jobs that only want to know what the data looked like at close-of-business yesterday.
I know of a few options, none of which seem quite satisfactory:
Store each full dump independently as a new subdirectory. This is the most obvious design, simple, and very compatible with the MapReduce paradigm. I'm sure some people use this approach, but I have to wonder how they justify the cost of storage? Supposing 1Tb is loaded each day, then that's 365Tb added to the cluster per year of mostly duplicated data. I know disks are cheap these days, but most budget-makers are accustomed to infrastructure expanding proportional to business growth, as opposed to growing linearly over time.
Store only the differences (delta) from the previous day. This is a natural choice when the source systems prefer to send updates in the form of deltas (a mindset which seems to date from the time when data was passed between systems in the form of CD-ROMs). It is more space efficient, but harder to get right (for example, how do you represent deletion?), and even worse it implies the need for consumers to scan the whole of history, "event sourcing"-style, in order to arrive at the current state of the system.
Store each version of a row once, with a start and end date. Known by terms such as "time variant data", this pattern pops up very frequently in data warehousing, and more generally in relational database design when there is a need to store historical values. When a row changes, update the previous version to set the "end date", then insert the new version with today as the "start date". Unfortunately, this doesn't translate well to the Hadoop paradigm, where append-only datasets are favoured, and there is no native concept of updating a row (although that effect can be achieved by overwriting the existing data files). This approach requires quite complicated logic to load the data, but admittedly it can be quite convenient to consume data with this structure.
(It's worth noting that all it takes is one particularly volatile field changing every day to make the latter options degrade to the same space efficiency as option 1).
So...is there another option that combines space efficiency with ease of use?