We are thinking of replacing existing oracle datamart and its corresponding ETL jobs to Apache platform. Among various challenges we see one that stood out was to maintain Fact keys Surrogate keys (i.e Dimensional table Keys ) for everyday loads. In specific How can we manage to generate unique keys while keeping the data partitioned ? Does anyone have any experience implementing whole datawarehouse using primarily HIVE and PIG. Ideally we would not want to use any other etl tools like talend etc.,
Asked
Active
Viewed 230 times
0
-
How are you planning to load the dimensions? That's where the SK values are coming from. – Marek Grzenkowicz Oct 16 '15 at 18:31
-
Marek , while that's true isn't it the same thing to say Fact table keys as facts don't have separate keys at least not in the implementation's that i saw. Anyhow i dint think that would confuse people. But i am adding that in question too just to make it clear. – vivek ashodha Oct 16 '15 at 19:13
-
I usually use them in RDBMS implementations as well. They're convenient, but - technically speaking - redundant; composite key made of SKs is enough. If you actually need them, consider GUIDs (Pig can call appropriate Java function via reflection) or implementing a sequence generator with Zookeeper. – Marek Grzenkowicz Oct 16 '15 at 20:20
-
Marek, Thanks for the pointers i like the zookeeper idea , i don't want any synch issues with Pig. – vivek ashodha Oct 19 '15 at 16:11
-
alternatively if GUIDs are not applicable you can use Cassandra for generating sequences as described here http://stackoverflow.com/questions/3935915/how-to-create-auto-increment-ids-in-cassandra – leftjoin Jan 04 '16 at 14:39