I am very new to Qubole.We recently migrated Oracle ebiz data to Saleforce.We have both Ebiz and Salesforce data in the Qubole Data Lake.There are some discrepancies between Ebiz and Salesforce.What is the technology I can use on Qubole to find these discrepancies?
-
Hive, Pig, or Spark would be a start, but that is a generic answer for HDFS/S3 data using any Hadoop-related stack – OneCricketeer Dec 07 '21 at 00:40
1 Answers
This is approach I am using to compare two tables. Aggregate all metrics in two tables group by all dimensions, then compare using FULL JOIN, it will return all joined and not joined records from both tables. In such way you can found absent data in both tables and differences in metrics.
For example like this, using Hive:
with
sf as (
select dimension1, dimension2, ... dimensionN,
sum(metric1) as metric1,
sum(metric2) as metric2,
...
sum(metricN) as metricN,
count(*) as cnt
from Salesforce_table
group by dimension1, dimension2, ... dimensionN
),
eb as (
select dimension1, dimension2, ... dimensionN,
sum(metric1) as metric1,
sum(metric2) as metric2,
...
sum(metricN) as metricN,
count(*) as cnt
from Ebiz_table
group by dimension1, dimension2, ... dimensionN
)
--compare data
select sf.*, eb.*
from sf full join eb on NVL(sf.dimension1,'')=NVL(eb.dimension1)
and sf.dimension2=eb.dimension2
...
and sf.dimension3=eb.dimension3
--Filter discrepancies only
where ( sf.metric1!=eb.metric1
or sf.metric2!=eb.metric2
...
or sf.metricN!=eb.metricN
or sf.cnt!=eb.cnt
or sf.dimension1 is null
or eb.dimension1 is null
)
Also you can easily compare in Excel instead of filtering in the WHERE.
Metrics are everything that can be aggregated. You can use some dimensions as metrics also like this count(distinct user) as user_cnt
and group by date, site_name
for example. Query with full join will show differences. If some dimensions used in join condition can be null, use nvl() to match such rows like in my example. Of course do not use too many dimensions in the groupby, you can skip some of them and drill down only after finding discrepancies on aggregated level.
After you got discrepancy in aggregations, you can drill down and compare rows not aggregated, filtered by some metrics.

- 36,950
- 8
- 57
- 116