Design of DB table with many rows and many columns with binary information

Question

We want to create a table in a sql database with ~10^6 rows. Every entry has many binary properties (say about 30, every row only has a few properties set True, and most set False) and a few integral properties (say about 5).

How do I set up such a table?

Specifically, should I just have one column in the table for each property (binary or integral) or should I make columns for the integral properties and new table with the binary properties together with a many-to-many relationship? Or is there another even better/cleaner option ?

I should add that

we will often query rows with a given combination of properties, so we would like those selects to be easy to write, clean, and fast,
we will regularly add binary properties

A typical entry would look like, with integral properties I and binary properties B:

EntryID | I1 | I2 | B3 | B4 | B5 | B6 | B7 | B8 | B9 | ... | Bn
---------------------------------------------------------------
1234567 | 12 | 5  | 2  | F  | F  | F  | F  | T  | F  | ... | F

Herris · Answer 1 · 2016-06-03T08:45:59.373

1

Instead of having too many columns with only true/false value, I would suggest to have integer type column named like 'some_status' to replace some of the properties with same category. Like for example some_status = 10 stands for active, some_status = 20 stands for inactive, some_status = 30 stands for pending etc. It probably will help to reduce some columns.

SUGGESTION 2

As you mentioned you will add binary properties regularly, so I would suggest to design your database like below so you can update Binary_property table any time you want.

And for your situation where only few binary properties will be true, you can consider to only add those binary properties in Entry_Binary_properties table when it is true. Later when you select, if the Binary property is not in Entry_Binary_properties table, it will be false in default.

Hope this will help. =)

edited Jun 03 '16 at 08:45

answered Jun 03 '16 at 07:16

Herris

61
5

Thanks for your answer -- I must confess that I cannot quite believe that this is the right way of setting up the db structure. I also added a sentence about queries depending on the properties. Your suggestion makes that really hard to do. – Christian Jun 03 '16 at 07:30
I understand it may not applied well on your situation, maybe you can update some sample of your codes or binary/integral properties so that it would be easier for us to answer for you. =) – Herris Jun 03 '16 at 07:51
Added a suggestion 2 on above, hope it helps you to figure out some image on your database design. =) – Herris Jun 03 '16 at 08:47

score 1 · Answer 2 · edited May 23 '17 at 12:22

You will know the performance when you test it. It should take at most two hours to jam in test data for both bullet points you have put together with dummy data. To be honest, your first bullet point dummy data will take a lot less time to generate than your second bullet. How you would do that would be thru a varied set of about 5000 rows then repeat them similar to that link above. That way it keeps your indexes honest and close to real life experiences.

The bullet points pros and cons that immediately come to mind are this:

Your first bullet point will benefit greatly from a Covering Index (or several). That means your Read queries will be screaming fast by comparison. You benefit from index page "covered" information without the need to traverse from the index to data pages. Note that your covering index is viable across potentially all of your binary and integral columns as you call them because they are thin.

Depending on your queries, and only you would know, you would need to investigate Composite Indexes a.k.a. Multi-Column Indexes too. Reason being, speed on retrieval.

The difference between Covering and Composite is, though both are on multiple columns, the Covering index would not need a trip to the data page for retrieval of info on reads.

On the con side, any of your regular changes to the schema will need to occur with alter table statements and index regeneration. On a 10^6 row table that is relatively minor. On a 10^9 different story.

Thus ends commentary on bullet one.

Your second bullet point (association / junction / intersect tables) will benefit from a more sane developer's approach when it comes time for changes. But it will suffer in performance when compared to covering or composite index strategies used in your first bullet. I would estimate that the order of retrieval will be magnitudes slower. Just a guess, worth a bet, not hard to test.

In either case, only you will know when you have the right balance of index choices that are never a freebie. With speed on retrieval comes the price of slowness on insert/update.

Thanks, @Drew for your answer, I have to play and digest it and I will do so! — Christian, Jun 07 '16 at 15:29

score 0 · Answer 3 · edited Jul 28 '16 at 10:28

0

I've had a similar issue and went with a cross references (xref) table. it's a table with 2 primary key columns that are both foreign keys to the tables in question.

CREATE TABLE Table1Table2Xref (
  Table1id INT foreign key references table1(Id),
  Table2id INT foreign key references table2(Id),
  info char(200),
  primary key (userid, userdataid),
);

edited Jul 28 '16 at 10:28

SOFe

7,867
4
33
61

answered Jul 28 '16 at 09:07

tony

11
2

Design of DB table with many rows and many columns with binary information

3 Answers3

Linked