3

Background

We are a courier company specialized in same-day-delivery for webshops. Customers can choose for same-day-delivery when ordering at a webshop, the webshop registers those packages in our system and at a certain cut-off time, we pick up those packages and deliver them later in the evening.

Our situation is pretty common I think: over the past few years the company has been collecting a ton of data. This data just sits in the operational database and every once in a while an Excel export is made and send to a freelancer to create a report in Power BI. So our management wants to start working with automated daily reports and scorecards for our drivers and I have been tasked to figure that out. My plan is to build a data warehouse using MS SQL Server, SSMS, SSIS and Power BI (maybe Stitch and Dataform for ETL and modelling to replace SSIS)

The problem

My boss wants to know what the number of several statuses are at the end of the day. But right now, that is impossible because the when the status changes, the entire row is overwritten. A shipment goes through 6 statuses and because we pickup and deliver the shipment within the same day (or actually within 8 hours), the shipment goes through all statuses in a single day. This also means that the ship date is always the same as the delivery date.

For example: If we ship 100 packages on 2019-11-25 and 10 of those are failed to deliver, those 10 shipments will receive the status "failed to deliver" and will be send again the next day, so 2019-11-26. However, after 4am the status of shipments that are failed to deliver are reset to "in depot". This means that when I fill the data warehouse with those 100 packages BEFORE 4am 2019-11-26, I'll have 100 shipments, with 90 delivered and 10 failed to deliver with the date 2019-11-25, which is correct.

Now let's say that on 2019-11-26, the 10 packages that were failed to deliver are now successfully delivered and I run the ETL process before 4am 2019-11-27, the data warehouse will no longer contain the 10 shipments that were failed to deliver on 2019-11-25. Instead it will only say that 90 shipments have been shipped and delivered on 2019-11-25 and 10 delivered on 2019-11-26, instead of 100 shipped on 2019-11-25 with 90 delivered and the 10 failed to deliver. This is because those 10 shipments have been overwritten with the status "delivered".

So my idea is to treat the fact table as SCD 2 to track the status of shipments but also the delivery dates. Every shipment_id represents one shipment and one package(boxes) as we do not send multiple packages in one shipment. This means that a count of the shipment_id's gives me the amount of shipments we have done. With SCD 2 that can still be achieved with a distinct count so the reports will still show the right amount of shipments on a single day.

Another solution (I think) would be to do a count of all the statuses and save those per day in a separate table in the data warehouse. Like a periodic snapshot fact table.

What do you guys think of this approach to track the status of the shipments by applying SCD2 to the fact table to show the amount of certain statuses at the end of every day?

Arwish
  • 31
  • 2
  • status change is the fact data. You can add a new row in the fact table everytime the status changes for a package delivery (deliver time, status). Is this what you meant by SCD 2 fact? From your description, it seems to me it is hard to capture the status changes because the OLTP system does not do that. Loading time will be tricky. You can also add a update trigger to the OLTP table and whenever the status is updated, track it in a separate table. – PeterHe Nov 27 '19 at 14:26
  • Well, if I would run the ETL process at the end of every day, I'll have the status of a package at the end of that day. If the status has changed during the next day, it will capture that new status and write that as a new record at the end of the next day. Regarding you update trigger idea, this sounds good as well. But that means that the separate status table will be a dimension table in the star schema, right? – Arwish Nov 27 '19 at 14:50

0 Answers0