1

I am creating a DW for an OLTP that is creaking somewhat.

A problem I'm faced with is that there isn't much data integrity in the OLTP database. An example would be a Suburb field.

This suburb field is a free text field on the OLTP UI which means we've got values in the field, plus we've got empty strings and we've got NULL values.

How would we usually handle this? The scenarios I've come up with are:

  1. Import data as is (not ideal)
  2. In my ETL process, treat any empty string the same as a NULL and replace that with the word 'Unknown' in the DW
  3. Import both empty strings and NULL's as empty strings in the DW

Just FYI, I'm using the Microsoft BI stack (SQL Server, SSIS, SSAS, SSRS)

Paul
  • 3,072
  • 6
  • 37
  • 58
  • 1
    Why is NULL such a no-no in a DW? I don't understand the desire to follow DW concepts from the 70s - if a birth date is unknown, it's not 1900-01-01. – Aaron Bertrand Apr 17 '13 at 22:37
  • I would probably agree with you on that. This question is more about handling empty strings V NULL's in an OLTP – Paul Apr 17 '13 at 23:05
  • 1
    The question is very much like you're worried about what to do with this data once you move it to the DW. If you're trying to correct the source OLTP system, you should make that more clear in the question (and the title). – Aaron Bertrand Apr 18 '13 at 00:03
  • That is what I'm worried about. The question is about handling empty strings and NULL values in the OLTP database, and what the opinion is on how they should be transformed (if at all) before inserting in the DW. – Paul Apr 18 '13 at 01:37
  • Well if null and empty string mean the same thing, pick one and don't allow the other. And be consistent in both systems so your semantics are always the same. – Aaron Bertrand Apr 18 '13 at 01:52

2 Answers2

4

The short answer is, it depends on what NULL and empty strings mean in the source system.

This general question (handling NULL) has been discussed a lot, e.g. here, here, here etc. I think the most important point to remember is that a data warehouse is just a database; it may have a very specific type of schema and be designed for one purpose, but it's still just a database and any general advice on NULL still applies.

(As a side note, I sometimes prefer to talk about a "reporting database" rather than a "data warehouse", because it keeps things in perspective. Some DBAs and developers start making plans for huge server farms and multi-year ETL projects as soon as they hear the words "data warehouse", but in the end it's just a reporting database.)

Anyway, it isn't completely clear where you want to use NULL but it looks like it may be an attribute on a dimension.

I (probably) wouldn't use any of your three approaches, but it depends on the meaning of your data. Importing the data as-is is not useful because part of the value of a data warehouse is that the data has been cleaned and is consistent, which makes querying and comparing data along other dimensions much easier.

Replacing empty strings with 'Unknown' may or may not be correct: what does an empty string mean in the source system? There's a big difference between "it means there's no suburb" and "it means we don't know if there's a suburb". Assuming that an empty string means "no suburb" and NULL means "unknown" then I would import the empty strings as they are, but replace NULL with 'Unknown'. The main reason for doing that is that if the Suburb field will be used as a filter condition in a report, it's easier for users (and possibly your reporting tool) to work with a non-NULL value like 'UNKNOWN'. And if there is no consistency in the source system and you don't know what empty strings and NULLs mean, then you need to clarify that first and ideally fix the source system too (another benefit of a DWH is that it helps to identify inconsistencies and data handling errors in source systems).

Your last idea to convert NULLs to empty strings is the same issue: what does a NULL actually mean in the source system? If it means "no suburb" then replacing it with an empty string is probably a good idea, but if it means something else then you should handle it as something else.

So to summarize, my preference would be to import empty strings as-is, and convert NULL to 'UNKNOWN', but I can't be sure that this actually makes sense in your case. There's no single answer to this question because it all depends on your specific data and what it means. But there's no problem with using NULL in a data warehouse (or any other database) as long as you do it consistently and with a clear understanding of how the source systems handle data.

Community
  • 1
  • 1
Pondlife
  • 15,992
  • 6
  • 37
  • 51
  • 1
    It gets worse if there's more than one possible meaning for an empty string - what if it means that either "there is no suburb for this record, but it is applicable to this row; we don't know if there's a suburb; a suburb exists but hasn't been loaded yet; suburb is not applicable to this row" – N West Apr 18 '13 at 18:26
  • @NWest Yes, I absolutely agree, and that's exactly why you need to understand the source system very well. The worst culprits in my experience are user-definable custom fields, or third-party applications that are impossible to modify so users (ab)use existing but unused fields for storing all sorts of odd data that was never meant to be there. It's even worse when two users use the same field to mean different things, which is not uncommon in systems like CRM: each user sees only their own data anyway, so they tend to consider the data and even the GUI to be 'theirs'. – Pondlife Apr 18 '13 at 18:40
1

Semantically, NULL would usually mean undefined/unknown. Whereas, "" empty string would mean that the value is known to be empty. In your suburb example, NULL could mean that it is not known whether there is a suburb for the given record, while "" could mean there is for sure no suburb for the given record.

If the meaning of NULL and "" are identical in your situation, it is best to normalize both values to same thing (say "") before importing to DW to make it easier to do your reports later (so as not to have NULL = 50 and "" = 34 and having to add them together).

Arsen
  • 965
  • 8
  • 7