1

Context: I have a table with 20 columns. This table has records that are imported from files and later they have to be processed. Therefore a column called ProcessDateTime exists to mark a record as processed or not.

This table will reach millions of records. Some of the records will never be processed.

I have a few questions:

  • Let's say that 95% of the records will be marked as processed. Should I create a child table to store this ProcessedDateTime?
  • What if only 50% of will be processed? Should it make any difference on the database design?
  • We are talking about one column only. What if they where 5 or 6 columns in the same situation?
  • Is there even a line to be drawn in here?

Thanks

Edit: When I select my data, I want the records that are not processed and that are not older than one month. So, I will have a date field to tell me when they were inserted.

Paulo Lima
  • 184
  • 5
  • Please post your table structure you have thought – Joe Taras Jan 14 '19 at 10:27
  • Also, it depends a little bit on what will happen for any record that is processed. Are they to be kept forever? Will they be updated any time later? – FXD Jan 14 '19 at 10:54
  • Possible duplicate of [How can you represent inheritance in a database?](https://stackoverflow.com/questions/3579079/how-can-you-represent-inheritance-in-a-database) – philipxy Jan 14 '19 at 22:58

1 Answers1

2

I don't recommend separating out the records into a separate table -- that just makes managing the database more complicated.

When you are designing the database, a very important consideration is what queries will be run, rather than mere theoretical considerations. I am guessing that you want to quickly access rows where ProcessDateTime is NULL.

What you are concerned about (in this case) is that the rows you need will be thinly spread over many rows. This in turn means that you have lots of data pages in memory, but most of the records would not be of interest because the value is not NULL.

Databases have two other options for dealing with this: clustered indexes and partitions (although these are not both supported by all databases). Both of these incur extra overhead when you update the NULL value to a non-NULL value -- basically the record needs to "switch" partitions/data page location.

That said, there are some cases where having a table of unprocessed rows makes sense, and these are then "archived" after processing. Although functionally the same, the description of the problem is different. In this description, the focus is on the unprocessed rows; when processed they go to live in the archive where presumably the values will not be changed.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I agree, while the main table can get smaller the extra join to fetch the data is not worth it (unless that join will be very rarely used, maybe) – jean Jan 14 '19 at 12:28