1

1st Table : AggregateValuesSum - SumProfit and NbCustomers are calculated group by all others characteristics columns.

Date        MonthStatus Gender   Age   Income   State   SumProfit  NbCustomers
2011-01-01  Status1     F        50    12000    CA       450000    3
2011-01-01  Status2     F        50    12000    CA       450000    2
2011-01-01  Status1     F        50    12000    CA       46000     2
2012-01-01  Status1     H        25    15000    NF       -500      1
2012-01-01  Status2     H        45    20000    QC       12000     5

2nd Table : AggregatesBudget - Same table with AggregatesValueSum without MonthStatus column. SumBudget is calculated group by all others characteristics (For others analysis):

Date         Gender   Age   Income   State      SumBudget
2011-01-01   F        50    12000    CA         30000
2012-01-01   H        25    15000    NF         10000
2012-01-01   H        45    20000    QC         5000

Final table : ConsolidationTable - Left join of 2 first tables on commons rows ( Gender, Age, Income, State). So i have all characteristics and all my calculated values.

Date        MonthStatus Gender   Age   Income   State   SumProfit  NbCustomers  SumBudget
2011-01-01  Status1     F        50    12000    CA       450000    3            30000
2011-01-01  Status2     F        50    12000    CA       450000    2            30000
2011-01-01  Status1     F        50    12000    CA       46000     2            30000
2012-01-01  Status1     H        25    15000    NF       -500      1            -20000
2012-01-01  Status2     H        45    20000    QC       12000     5            5000

As you can see, SumBudget is duplicated in this final table for rows that have same characteristics without considering the Sumprofit and MonthStatus. (Which is normal)

Everything seems fine for now. Then i validate the total of SumBudget per Date and compare the result given in 2nd tables and Final table :

Select
 Date,
 Sum(SumBudget) as TotalBudgetPerDate
from AggregatesBudget
 group by Date

For some dates it give me a different result when i do this :

Select
  Date,
  Sum( distinct SumBudget)  as TotalBudgetPerDate
from ConsolidationTable
group by Date

THE PROBLEMS :

  • The TotalBudgetPerDate in the AggregatesBudget Table is always what i expect so it's OK.
  • I have to add a distinct in my 2nd query because in consolidation table there is duplicated SumBudget

  • For most of Date I have same result with AggregatesBudget Table. But sometimes the result in the ConsolidationTable is not right for some dates.

I don't understand why!

NB: Here was fake data in reality I have 49Md rows. And I won't share real data because of sensitivity.

Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
Sara
  • 17
  • 5
  • 1
    create the `DDL` and expected result that would be alot easier – RoMEoMusTDiE Jan 09 '20 at 23:19
  • What do you mean by a DDL ? – Sara Jan 09 '20 at 23:21
  • 1
    DDL = data definition language, = a script that can be run to create your table + a script that inserts some sample values. But in summary, the `float` datatype is not suitable for dollar figures. float is imprecise. – Nick.Mc Jan 09 '20 at 23:24
  • Oh ok a validation script. But why doing this if i already know which dates gave me wrong result ? (I have a a validation script). If float is imprecise for dollar what could i use ? :( – Sara Jan 09 '20 at 23:25
  • Most people use `MONEY`. You can also use `DECIMAL(19,2)`. Definitely do not use `REAL` or `FLOAT` – Nick.Mc Jan 09 '20 at 23:40
  • I.ve changed it for a decimal. Tried a money to. it doesn't change. For a month there 10 000$ difference. I think the problem is in my Consolidation table. There is duplicated Budget for each similar caracterics(without considering monthstatus and profit) So when i do a distinct i exclude rows that aren't concerned for this problem. I don't know how to solve it – Sara Jan 09 '20 at 23:40
  • OK now that `FLOAT` is out of the picture, can you confirm what type of database it is? SQl Server? Oracle? – Nick.Mc Jan 09 '20 at 23:41
  • BTW converting FLOAT to DECIMAL in a 49 million row table is non trivial. Also I forgot `FLOAT` has different properties in Oracle and SQL Server so you really need to confirm what kind of database this is. – Nick.Mc Jan 09 '20 at 23:49
  • In summary, `distinct` is not the answer because it will remove bidget figures that are for legtimiately different characterstics, but just happen to be the same number. The answer is to roll up actuals and budgets by date first, then combine them. – Nick.Mc Jan 09 '20 at 23:56
  • This seems likely to be a common error where people want some joins, each possibly involving a different key, of some subqueries, each possibly involving join and/or aggregation, but they erroneously try to do all the joining then all the aggregating or to aggregate over previous aggregations. PS Show the DDL, queries & table values from which you got these table values. This is always best/requred in debugging questions. [mre] [ask] But also your descriptions aren't clear & detailed. – philipxy Jan 10 '20 at 00:32
  • Does this answer your question? [Strange duplicate behavior from GROUP\_CONCAT of two LEFT JOINs of GROUP\_BYs](https://stackoverflow.com/questions/45250646/strange-duplicate-behavior-from-group-concat-of-two-left-joins-of-group-bys) – philipxy Jan 10 '20 at 00:33
  • It's on SQL server 2017 – Sara Jan 10 '20 at 14:45

2 Answers2

0

distinct SumBudget will remove duplicate values.

Sum(distinct SumBudget) Will only add the unique values, causing data loss for some dates.

Shep
  • 638
  • 3
  • 15
0

This will occur if you have duplicate values in the aggregatesbudget table on a single date (assuming you have no duplicate keys overall). You can check this using:

select date, sumbudget, count(*)
from aggregatesbudget
group by date, sumbudget
having count(*) > 1;

You could fix this using window functions:

select date, sum(sumbudget)
from (select ct.*,
             row_number() over (partition by Gender, Age, Income, State, date order by date) as seqnum
      from ConsolidationTable ct
     ) ct
where seqnum = 1
group by date;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786