In particular I am dealing with a Type 2 Slowly Changing Dimension and need to represent the time interval a particular record was active for, i.e. for each record I have a StartDate and an EndDate. My question is around whether to use a closed ([StartDate,EndDate]) or half open ([StartDate,EndDate)) interval to represent this, i.e. whether to include the last date in the interval or not. To take a concrete example, say record 1 was active from day 1 to day 5 and from day 6 onwards record 2 became active. Do I make the EndDate for record 1 equal to 5 or 6?
Recently I have come around to the way of thinking that says half open intervals are best based on, inter alia, Dijkstra:Why numbering should start at zero as well as the conventions for array slicing and the range() function in Python. Applying this in the data warehousing context I would see the advantages of a half open interval convention as the following:
- EndDate-StartDate gives the time the record was active
- Validation: The StartDate of the next record will equal the EndDate of the previous record which is easy to validate.
- Future Proofing: if I later decide to change my granularity from daily to something shorter then the switchover date still stays precise. If I use a closed interval and store the EndDate with a timestamp of midnight then I would have to adjust these records to accommodate this.
Therefore my preference would be to use a half open interval methodology. However if there was some widely adopted industry convention of using the closed interval method then I might be swayed to rather go with that, particularly if it is based on practical experience of implementing such systems rather than my abstract theorising.