2

What is the best way to handle missing values in a dimension table?

In the case of a textual column, it is easy to write "NA: Missing," but what should be done for numeric columns where it is important to retain the specific values. Note: I do not want a solution that uses banded values (e.g., textual columns for "0-50", "50-100", "NA: Missing").

For instance, a customer dimension may have a year-of-birth. How should missing years of birth be handled? Leave it null? Add in an arbitrary number as a placeholder such as 1900?

Sometimes, it may be difficult to find a placeholder number. For instance, if sales-to-date are non-negative, but can be zero I wouldn't want to put "0" as a placeholder for null. I could use negative values such as "-1", but that would ruin queries that use sums.

Damir Sudarevic
  • 21,891
  • 3
  • 47
  • 71
d_a_c321
  • 533
  • 1
  • 11
  • 23
  • 2
    Aren't you answering your question yourself? What's so bad about NULL? – fancyPants Apr 16 '13 at 12:02
  • sales-to-date cannot be null. In such case error has to be raised during ETL process. We have null for customers with missing birthdate in our DW. – Tomas Greif Apr 16 '13 at 12:13
  • @tombom, aren't nulls strongly discouraged in dimension tables? I wanted to clarify whether they should be permitted in the case of a numeric value.@twn08, sales-to-date could be null if sales data were missing. You're right that birthdate example is better. What is your logic for keeping birthdate null? Also -- do you use a flag to indicate that the value is null? For instance, birthday_is_null and so on for each null column? Do you know of any references that discuss this issue or explain why placeholders are an unequivocally bad idea? – d_a_c321 Apr 16 '13 at 13:38
  • 2
    There are no laws about this: you decide what's permitted in your own database and what isn't. `NULL`s are a controversial topic for some reason, but they're just a tool like any other. [This question](http://dba.stackexchange.com/questions/5222/why-shouldnt-we-allow-nulls) on the DBA site is a good one to review. Our DWH uses NULLs wherever a default value would distort report data (this is typically an issue on fact tables). The only downside is that your reporting tool has to handle them properly if you use them as report criteria, which may be the main consideration for dimensions. – Pondlife Apr 16 '13 at 17:09
  • 3
    I agree with @Pondlife - default values sometimes do more harm than good. Customer's a particuarly thorny one because there are often a lot of customer attributes that we *don't know* for any number of reasons. I prefer to leave columns NULL if they are truly unknown. – N West Apr 16 '13 at 17:23
  • Thanks so much for your replies, @NWest and pondlife.. can I also get your thoughts on whether it makes sense to create an "var_is_null" flag and how you perform queries when there are null values? – d_a_c321 Apr 22 '13 at 20:33

1 Answers1

3

In your fact table you never use a null value for a foreign key, but you can and should use null values for the metrics where appropriate. A null value will give accurate results when aggregated, where a default value will not.

In dimension tables also the attributes can and should be null where appropriate, for the same reason. While it's less common to do aggregation of the dimension values, it does happen, so it should be right when it happens.

If you have a need for an empty value in a dimension then the dimension should have a row for the purpose. The Date Dimension for instance might have 3 or 4 special rows - no value, unknown, past and future are reasonable special value rows, depending on your needs.

You will save yourself a lot of pain and suffering in the BI layer this way.

Corey
  • 15,524
  • 2
  • 35
  • 68
  • Any thoughts on how to handle multiple reasons for null values of numeric facts? I'm not sure I see an easier way than to create a separate column null_status = (PRESENT, 'Null: Reason 1', "Null: Reason 2', etc.). – d_a_c321 Sep 23 '13 at 21:02