Could partitioning by a date be a feasible way to track time variance of data in a Big Data environment? My hope is to achieve something similar to the concept of slowly changing dimensions in RDBMS. Let's assume the following scenario to keep it simple:
Scenario
I have a hadoop cluster with our data residing in hdfs, currently in .csv files. I also want to use Apache Impala as the query engine. We have some customer data like this:
Nr., Gender, Title, firstname, name, birthday, street, plz, city, phone
1, Frau, Dr., Jenny, Hutch, 23.03.1924, Abcstr. 79, 97230, Duggenfeld, 093 / 38700
Every day, new data will arrive via .csv (Lets just say every day a new, complete version of the customer data is delivered). The new data has to be integrated into our storage system.
The Plan
The idea was, that I could enrich the customer data with a timestamp of its delivery:
Nr., Gender, Title, firstname, name, birthday, street, plz, city, phone, deliverydate
1, Frau, Dr., Jenny, Hutch, 23.03.1924, Abcstr. 79, 97230, Duggenfeld, 093 / 38700, 20170814
Then, when creating the corresponding impala table, i would just use the timestamp of delivery to partition the Table.
In theory, this would give us a full daily snapshot of the data that we could query against in the future.
This would not create the same table-structure as in SCD2 where we have a timespan of validity, but by querying over different days we could see when, for example, a name changed.
Do you think this is a good use of partitioning or do I already have a flaw in my thoughts that i can't see at the moment?
There might also be deliveries in the future where the new data that comes daily is just a delta of changed/new values. This could be handled by joining the new data over the data from the last day to spot changes and new entries.
I already read some interesting posts on here:
- What is the difference between partitioning and bucketing a table in Hive ?
- Slowly changing dimensions- SCD1 and SCD2 implementation in Hive
- How does Impala supports Partitioning?
And I also watched the EDW 101 for Hadoop Professionals Webinar by Cloudera, but they do not mention partitiong as a way to deal with time-variance.
I have very little practical experience with hadoop and Impala, so I appreciate every answer.