0

Could partitioning by a date be a feasible way to track time variance of data in a Big Data environment? My hope is to achieve something similar to the concept of slowly changing dimensions in RDBMS. Let's assume the following scenario to keep it simple:


Scenario

I have a hadoop cluster with our data residing in hdfs, currently in .csv files. I also want to use Apache Impala as the query engine. We have some customer data like this:

Nr., Gender, Title, firstname, name, birthday, street, plz, city, phone
1, Frau, Dr., Jenny, Hutch, 23.03.1924, Abcstr. 79, 97230, Duggenfeld, 093 / 38700

Every day, new data will arrive via .csv (Lets just say every day a new, complete version of the customer data is delivered). The new data has to be integrated into our storage system.


The Plan

The idea was, that I could enrich the customer data with a timestamp of its delivery:

Nr., Gender, Title, firstname, name, birthday, street, plz, city, phone, deliverydate
1, Frau, Dr., Jenny, Hutch, 23.03.1924, Abcstr. 79, 97230, Duggenfeld, 093 / 38700, 20170814

Then, when creating the corresponding impala table, i would just use the timestamp of delivery to partition the Table. In theory, this would give us a full daily snapshot of the data that we could query against in the future.
This would not create the same table-structure as in SCD2 where we have a timespan of validity, but by querying over different days we could see when, for example, a name changed.

Do you think this is a good use of partitioning or do I already have a flaw in my thoughts that i can't see at the moment?
There might also be deliveries in the future where the new data that comes daily is just a delta of changed/new values. This could be handled by joining the new data over the data from the last day to spot changes and new entries.

I already read some interesting posts on here:

And I also watched the EDW 101 for Hadoop Professionals Webinar by Cloudera, but they do not mention partitiong as a way to deal with time-variance.
I have very little practical experience with hadoop and Impala, so I appreciate every answer.

  • Voted to close, SO is not the right place for this. Being that said - **(1)** It seems that you don't understand how Hive partitions are defined **(2)** Your logic for the "last day" is false. – David דודו Markovitz Aug 31 '17 at 16:50
  • -SO is not the right place for this. How so? i mean i get it, i'm new and its no question about some source code snippet but could you elaborate? - (2) Your logic for the "last day" is false - Again, i am here to learn but just saying "its false" does not really add anything – BourbonKid Aug 31 '17 at 18:51
  • "This site is all about getting answers. It's not a discussion forum. There's no chit-chat" https://stackoverflow.com/tour – David דודו Markovitz Aug 31 '17 at 19:59
  • Okay, if this came off as chit-chat and gets closed i will try to revise it. – BourbonKid Sep 01 '17 at 10:39

0 Answers0