Data comparisons in Qubole

Question

I am very new to Qubole.We recently migrated Oracle ebiz data to Saleforce.We have both Ebiz and Salesforce data in the Qubole Data Lake.There are some discrepancies between Ebiz and Salesforce.What is the technology I can use on Qubole to find these discrepancies?

Hive, Pig, or Spark would be a start, but that is a generic answer for HDFS/S3 data using any Hadoop-related stack — OneCricketeer, Dec 07 '21 at 00:40

leftjoin · Answer 1 · 2022-01-13T12:42:49.483

This is approach I am using to compare two tables. Aggregate all metrics in two tables group by all dimensions, then compare using FULL JOIN, it will return all joined and not joined records from both tables. In such way you can found absent data in both tables and differences in metrics.

For example like this, using Hive:

with 
sf as (
select dimension1, dimension2, ... dimensionN,
       sum(metric1) as metric1,
       sum(metric2) as metric2,
       ...
       sum(metricN) as metricN,
       count(*)     as cnt
  from Salesforce_table 
 group by dimension1, dimension2, ... dimensionN
),
eb as (
select dimension1, dimension2, ... dimensionN,
       sum(metric1) as metric1,
       sum(metric2) as metric2,
       ...
       sum(metricN) as metricN,
       count(*)     as cnt
  from Ebiz_table 
group by dimension1, dimension2, ... dimensionN
)

--compare data
select sf.*, eb.*
  from sf full join eb on NVL(sf.dimension1,'')=NVL(eb.dimension1)
                      and sf.dimension2=eb.dimension2
                      ...
                      and sf.dimension3=eb.dimension3
--Filter discrepancies only
 where (   sf.metric1!=eb.metric1 
        or sf.metric2!=eb.metric2
        ...
        or sf.metricN!=eb.metricN
        or sf.cnt!=eb.cnt
        or sf.dimension1 is null
        or eb.dimension1 is null
        )

Also you can easily compare in Excel instead of filtering in the WHERE.

Metrics are everything that can be aggregated. You can use some dimensions as metrics also like this count(distinct user) as user_cnt and group by date, site_name for example. Query with full join will show differences. If some dimensions used in join condition can be null, use nvl() to match such rows like in my example. Of course do not use too many dimensions in the groupby, you can skip some of them and drill down only after finding discrepancies on aggregated level.

After you got discrepancy in aggregations, you can drill down and compare rows not aggregated, filtered by some metrics.

See also: https://stackoverflow.com/a/67382947/2700344

Data comparisons in Qubole

1 Answers1