1

I have a membership database that I am looking to rebuild. Every member has 1 row in a main members table. From there I will use a JOIN to reference information from other tables. My question is, what would be better for performance of the following:

1 data table that specifies a data type and then the data. Example:

data_id | member_id | data_type | data
1 | 1 | email | test@domain.com
2 | 1 | phone | 1234567890
3 | 2 | email | test@domain2.com

Or

Would it be better to make a table of all the email addresses, and then a table of all phone numbers, etc and then use a select statement that has multiple joins

Keep in mind, this database will start with over 75000 rows in the member table, and will actually include phone, email, fax, first and last name, company name, address city state zip (meaning each member will have at least 1 of each of those but can be have multiple (normally 1-3 per member) so in excess of 75000 phone numbers, email addresses etc)

So basically, join 1 table of in excess of 750,000 rows or join 7-10 tables of in excess of 75,000 rows

edit: performance of this database becomes an issue when we are inserting sales data that needs to be matched to existing data in the database, so taking a CSV file of 10k rows of sales and contact data and querying the database to try to find which member attributes to which sales row from the CSV? Oh yeah, and this is done on a web server, not a local machine (not my choice)

  • The following answer might be of interest. http://stackoverflow.com/questions/4419499/mysql-nosql-help-me-to-choose-the-right-one-on-a/4421601#4421601 – Jon Black Apr 12 '12 at 12:44

4 Answers4

1

The obvious way to structure this would be to have one table with one column for each data item (email, phone, etc) you need to keep track of. If a particular data item can occur more than once per member, then it depends on the exact nature of the relationship between that item and the member: if the item can naturally occur a variable number of times, it would make sense to put these in a separate table with a foreign key to the member table. But if the data item can occur multiple times in a limited, fixed set of roles (say, home phone number and mobile phone number) then it makes more sense to make a distinct column in the member table for each of them.

If you run into performance problems with this design (personally, I don't think 75000 is that much - it should not give problems if you have indexes to properly support your queries) then you can partition the data. Mysql supports native partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), which essentially distributes collections of rows over separate physical compartments (the partitions) while maintaining one logical compartment (the table). The obvious advantage here is that you can keep querying a logical table and do not need to manually bunch up the data from several places.

If you still don't think this is an option, you could consider vertical partitioning: that is, making groups of columns or even single columns an put those in their own table. This makes sense if you have some queries that always need one particular set of columns, and other queries that tend to use another set of columns. Only then would it make sense to apply this vertical partitioning, because the join itself will cost performance.

(If you're really running into the billions then you could consider sharding - that is, use separate database servers to keep a partition of the rows. This makes sense only if you can either quickly limit the number of shards that you need to query to find a particular member row or if you can efficiently query all shards in parallel. Personally it doesn't seem to me you are going to need this.)

I would strongly recommend against making a single "data" table. This would essentially spread out each thing that would naturally be a column to a row. This requires a whole bunch of joins and complicates writing of what otherwise would be a pretty straightforward query. Not only that, it also makes it virtually impossible to create proper, efficient indexes over your data. And on top of that it makes it very hard to apply constraints to your data (things like enforcing the data type and length of data items according to their type).

There are a few corner cases where such a design could make sense, but improving performance is not one of them. (See: entity attribute value antipattern http://karwin.blogspot.com/2009/05/eav-fail.html)

Roland Bouman
  • 31,125
  • 6
  • 66
  • 67
  • Unfortunately putting all the date in the member table isn't an option as each member has at least 1 of each data type but can conceivably have an infinite number of them (most will only have 1-3 but I have to leave open the possibility for more). So having a query that joins an email address table and a phone number table and an address table etc then just a single join of a data table that says "select data from table where data_type = email or data_type = phone or etc? – Spenser David Fishel Apr 12 '12 at 00:42
  • Of the 75000 member you have now, did you have any numbers on how many have which attribute (like email, fax, phone etc), and how many instances of that attribute (in case of multiple emails etc)? It may be tempting to take into account that what is theoretically "conceivable" but this is usually not how systems get built. If a member really has, say, 100 email addresses, how would you know which one to use? Or are you going to use all of them? – Roland Bouman Apr 12 '12 at 10:57
  • I made the same point when i first started this job only to learn such sound logic is futile here. The issue is that higher ups have determined that they want infinite amount of attributes available per member just in case. They also like to add attributes and subtract them at any time (so one day we will collect phone numbers, the next day we wont, the next day we collect phone numbers of businesses but not individuals, etc). And yes, my company would infact use all 100 email address if they were available (we allow companies to join and we will get multiple contacts for their employees) – Spenser David Fishel Apr 12 '12 at 13:51
  • 193,943 phone numbers, 119,868 contact names, 75,084 email addresses, 71,080 addresses, and 16,783 fax numbers for 74,398 members (growing at just a tick over 50 members per day on average (last year 35 per day)) the other reason for the excess of contact information is for matching sales records to members (we sell other companies products so we get reports of purchases) so if someone signs up with 1 email address but purchases with another, we cant match those, but if we have 3 of their email addresses, we may be able to match that – Spenser David Fishel Apr 12 '12 at 14:00
  • In that case I suggest having 1 table for each distinct type of data item rather than one big table for each and any kind of data item. So you'd get a member_email, member_phone, member_address etc. You'll need a lot of (outer) joins if you need to retrieve all data items for a particular member but at least you'll benefit from a smaller data set when you don't need all attributes (a kind of vertical partitioning) and you'll be able to define more efficient indexes than with a generic data item table. – Roland Bouman Apr 12 '12 at 18:27
  • I would still experiment also with vertical partitioning. You should do that in such a way that the attribute tables use the member key as part of their partitioning key such that you hit only one partition per attribute table for one distinct member. – Roland Bouman Apr 12 '12 at 18:29
  • i guess that answers my question then, thank you for your advice – Spenser David Fishel Apr 12 '12 at 21:12
0

YOu should research scaling out vs scaling up when it comes to databases. In addition to aforementioned research, I would recommend that you use one table in our case if you are not expecting a great deal of data. If you are, then look up dimensions in database design.

kasavbere
  • 5,873
  • 14
  • 49
  • 72
0

75k is really nothing for a DB. You might not even notice the benefits of indexes with that many (index anyway :)).

Point is that though you should be aware of "scale-out" systems, most DBs MySQL inclusive, can address this through partitioning allowing your data access code to still be truly declarative vs. programmatic as to which object you're addressing/querying. It is important to note sharding vs. partitioning, but honestly are conversations when you start exceeding records approaching the count in 9+ digits, not 5+.

Jé Queue
  • 10,359
  • 13
  • 53
  • 61
0

Use neither Although a variant of the first option is the right approach. Create a 'lookup' table that will store values of data type (mail, phone etc...). Then use the id from your lookup table in your 'data' table. That way you actually have 3 tables instead of two. Its best practice for a classic many-many relationship such as this

Kudehinbu Oluwaponle
  • 1,045
  • 11
  • 11