2

I believe a common usage pattern for Hadoop is to build a "data lake" by loading regular (e.g. daily) snapshots of data from operational systems. For many systems, the rate of change from day to day is typically less than 5% of rows (and even when a row is updated, only a few fields may change).

Q: How can such historical data be structured on HDFS, so that it is both economical in space consumption, and efficient to access.

Of course, the answer will depend on how the data is commonly accessed. On our Hadoop cluster:

  • Most jobs only read and process the most recent version of the data
  • A few jobs process a period of historical data (e.g. 1 - 3 months)
  • A few jobs process all available historical data

This implies that, while keeping historical data is important, it shouldn't come at the cost of severely slowing down those jobs that only want to know what the data looked like at close-of-business yesterday.

I know of a few options, none of which seem quite satisfactory:

  1. Store each full dump independently as a new subdirectory. This is the most obvious design, simple, and very compatible with the MapReduce paradigm. I'm sure some people use this approach, but I have to wonder how they justify the cost of storage? Supposing 1Tb is loaded each day, then that's 365Tb added to the cluster per year of mostly duplicated data. I know disks are cheap these days, but most budget-makers are accustomed to infrastructure expanding proportional to business growth, as opposed to growing linearly over time.

  2. Store only the differences (delta) from the previous day. This is a natural choice when the source systems prefer to send updates in the form of deltas (a mindset which seems to date from the time when data was passed between systems in the form of CD-ROMs). It is more space efficient, but harder to get right (for example, how do you represent deletion?), and even worse it implies the need for consumers to scan the whole of history, "event sourcing"-style, in order to arrive at the current state of the system.

  3. Store each version of a row once, with a start and end date. Known by terms such as "time variant data", this pattern pops up very frequently in data warehousing, and more generally in relational database design when there is a need to store historical values. When a row changes, update the previous version to set the "end date", then insert the new version with today as the "start date". Unfortunately, this doesn't translate well to the Hadoop paradigm, where append-only datasets are favoured, and there is no native concept of updating a row (although that effect can be achieved by overwriting the existing data files). This approach requires quite complicated logic to load the data, but admittedly it can be quite convenient to consume data with this structure.

(It's worth noting that all it takes is one particularly volatile field changing every day to make the latter options degrade to the same space efficiency as option 1).

So...is there another option that combines space efficiency with ease of use?

Community
  • 1
  • 1
Todd Owen
  • 15,650
  • 7
  • 54
  • 52

1 Answers1

2

I'd suggest a variant of option 3 that respects the append only nature of HDFS.

Instead of one data set, we keep two with different kinds of information, stored separately:

  1. The history of expired rows, most likely partitioned by the end date (perhaps monthly). This only has rows added to it when their end dates become known.
  2. A collection of snapshots for particular days, including at least the most recent day, most likely partitioned by the snapshot date. New snapshots can be added each day, and old snapshots can be deleted after a couple of days since they can be reconstructed from the current snapshot and the history of expired records.

The difference from option 3 is just that we consider the unexpired rows to be a different kind of information from the expired ones.

Pro: Consistent with the append only nature of HDFS.

Pro: Queries using the current snapshot can run safely while a new day is added as long as we retain snapshots for a few days (longer than the longest query takes to run).

Pro: Queries using history can similarly run safely as long as they explicitly give a bound on the latest "end-date" that excludes any subsequent additions of expired rows while they are running.

Con: It is not just a simple "update" or "overwrite" each day. In practice in HDFS this generally needs to be implemented via copying and filtering anyway so this isn't really a con.

Con: Many queries need to combine the two data sets. To ease this we can create views or similar that appropriately union the two to produce something that looks exactly like option 3.

Con: Finding the latest snapshot requires finding the right partition. This can be eased by having a view that "rolls over" to the latest snapshot each time a new one is available.

RD1
  • 3,305
  • 19
  • 28