0

I am struggling with something in pentaho and I am not entirely sure if pentaho will be able to handle this problem. I will try explain as best I can.

So I have a column in my fact sales called reference number, which I must use to lookup an ID from the dimension table and return the ID. But if the first column i did my lookup on in the dimension returns a null , I need to check the same field from fact table lookup in another column in dimension and then another column again.

Is there a way in Pentaho where i can ask it to go through a process of 3 different lookups and return the id if a match in one of those 3 columns exist into the same column in fact sales.

I'm using MySQL as my database

Cœur
  • 37,241
  • 25
  • 195
  • 267

2 Answers2

0

This seems to be somewhat a basic task for Pentaho Data Integration.

You could do this manually by performing three Database lookup (or) Dimension lookup/update (depending on type of your dimension) which will store every lookup result in a different field.

Then, use a Modified Java Script Value to perform null coalescing - choosing first non-null value and finally if you need a Select Values script to remove three columns with lookup results that are no longer needed.

Below is a screen with a simplified case but I'm sure you can follow the logic behind it and implement it in your case scenario as I've mentioned steps that you could use to achieve the task:

enter image description here

Community
  • 1
  • 1
Kamil Gosciminski
  • 16,547
  • 8
  • 49
  • 72
  • Hi, Thank you for your help. This is what I am currently doing but it takes far too long and i though there must be a way where inj the 2nd lookup we only look through "not matched yet" records – Arnoux Olivier Sep 05 '16 at 14:14
  • What do you mean by far too long? To implement? To execute? Anyways, this seems odd to have a "dimension TK" in a fact table that points to different columns in a dimension. You could always execute SQL statement to perform only one lookup. – Kamil Gosciminski Sep 05 '16 at 14:40
  • Yeah, I know, ... What happens when working with very shitty data. So the column in fact sales could match one of 3 columns in my dimension, according to what the rules was back then in history. So what i meant by run is, The ETL that is supposed to add the ID's from the dimension where it matched onto the fact sales table – Arnoux Olivier Sep 05 '16 at 14:46
  • 1
    There is nvl function in calc step – simar Sep 05 '16 at 15:15
  • @simar there are many approaches but this does not seem to be the main issue here. Performance of lookups is :) – Kamil Gosciminski Sep 05 '16 at 15:20
  • Database Join step much better in term of performance at least there is no needs to establish extract connection to database. – simar Sep 05 '16 at 15:34
  • @ArnouxOlivier like I and simar suggested, you can implement your logic (if you have SCD type 2 or some mixed with versioning) in an SQL using Db Join step. – Kamil Gosciminski Sep 05 '16 at 15:58
0

It would be far faster to use a filter step. If the looked up first value is null filter on the null to break the null stream to a second lookup and the found data to your "found" step. Rinse and repeat till you have what you want.

Then use a multiway Merge Join to stitch your dataset back together. The merge join step might not even be necessary to be honest if the resulting streams are all identical which you can achieve with some select steps if they're not. No need to look everything up at once and this really is not taking advantage of the parallel processing at all to look it all up once and evaluate.

Does that help?

Codex
  • 131
  • 6