0

I was asked to use a database in which most of the primary keys, and other fields as well, uses char(n) to store numeric values with padding, for example:

product_id: char(8) [00005677]
user_id: char(6) [000043]
category_id: char(2) [05]

The reason they want to use it like that, is to be able to use characters (in the far future) if they want. However they have many rules based in numbers, for example, category_id from 01 to 79 correspond to a general category and from 80 to 89 is a special category and 90 to 99 is user defined category.

I personally think that using char(n) to store numbers is a bad practice. My reasons are:

  1. using char, " " != 0, 0 != 00, 05 != 5, 00043 != 000043, and so on. For that reason, the values have to be constantly checked (to prevent data corruption).
  2. If I pad a number: 0 -> 00, then I have to pay attention not to pad a character (A -> 0A)
  3. If characters are used, then ranges become strange, something like: from 01 to 79 and AB and RX and TZ and S, etc...
  4. Indexing numbers instead of chars result in a performance gain

I'm proposing to change it to decimal(n) with zerofill to make it more "error-proof", as this information is modified by different sources (web, windows client, upload csv). If they want to add more categories, for example, then updating from decimal(2) to decimal(3) will be easier.

My question then is: Am I wrong? can char(n) be trusted for this task? If "chars" are evil with numbers, then which other disadvantages am I missing in the above list (I may need better reasons if I want to win my case)?

TIA (any comment/answer will be appreciated).

lepe
  • 24,677
  • 9
  • 99
  • 108
  • I will accept the answer to whom can convince me that using char(n) that way is totally safe, or to whom can give me more reasons of why not using char. – lepe Aug 24 '11 at 03:14
  • 2
    @lepe: the case itself is absurd based on the "best practices". Leading zeros has nothing to do with storage, but only with presentation. As long as you want your ids to be shown with leading zeros - pad them **before** you output it to user, and store in any format you're comfortable working with. – zerkms Aug 24 '11 at 03:19
  • +1. I totally agree with you... IMHO using chars suddenly where normally only numbers are used may break things in the future. The only reason is that they want to keep in mind the number of chars, (this is: 99, AA (2 chars); 100 (3 chars)). But I think it does not worth the complication. It is just easier to increment the number if it is necessary... but explain to them! :S – lepe Aug 24 '11 at 03:30
  • char is effectively padded on the right with spaces for all comparisons in ANSI SQL. Not sure if MySQL has an issue there. But in your case varchar or char would make no difference because all these identifiers should have their length equal to the capacity of the column because they should all be padded on the left with '0'. You haven't mentioned whether in the future all codes are supposed to not contain spaces and be the full length? – Cade Roux Aug 24 '11 at 03:45
  • Yes, I didn't mentioned that because it is unclear now. Until now, only numbers are used (we are talking about IDs) and all must be of full length (padded with zeros). However, if they wish to use characters as well, I'm not sure if spaces will be allowed... probably, which may complicate things even more. The only reason of using characters is to have more IDs of the same length. That is why I see the use of char(n) as a nonsense limitation. If they want to add more IDs, then just increase the number, that is how IMHO should be. – lepe Aug 24 '11 at 03:57
  • I couldn't convince the customer... :( so I will have to apply some of the recommendations posted here. I'm marking Cade's answer as the recommendations given by gahooa were already applied (+1 for your help). Also Cade's posted others ways to deal with it... Thank you! – lepe Aug 24 '11 at 07:33

2 Answers2

1

Quoting your question:

...store numeric values with padding...

You did not show any examples of numeric data, only character data that happens to consist of numbers. If you had said that their OrderTotal column was a char(10), then I'd start to worry.

Just treat this as character data and you will be fine. I can see no business or technical case to change the database (unless you are beginning a near-total rewrite).

Regarding performance... If this is actually a concern, then you most likely have far bigger issues to deal with. MySQL is fast and accurate.

--

Write a function somewhere that will zerofill user inputted ID's for the purpose of querying. Use this function everywhere you need to accept user input. NEVER EVER use a numeric data type to store your data (if PHP, never use +, always use . to concat, etc...)

Remember, this is no different than Item_Number = "SHIRT123" or any other string ID you may encounter.

Take care

gahooa
  • 131,293
  • 12
  • 98
  • 101
  • Yes, there are other fields (like OrderTotal) in which int is used... no problem there. The database itself is very old (maybe 8 years old), but the system was rewritten totally from zero. Both, the web and the windows client. The system is not launched yet, so there is still time to propose some changes. My main concern is data integrity and reduce the possibility of bugs as it will be used by hundreds of branches. – lepe Aug 24 '11 at 03:09
  • "Just treat this as character data and you will be fine. I can see no business or technical case to change the database": The problem resides that I'm not the only developer and if someone miss-to-check the padding (when is a number) then you are adding corrupted data. As Cade commented, I would like to enforce it at the database (as the data is modified in different clients) if possible. – lepe Aug 24 '11 at 03:39
1

If this was SQL Server or Oracle or any other RDBMS, I would recommend enforcing a check constraint on those columns so that the data always matched the full capacity of the column - this would ensure your identifiers are uniform.

Unfortunately MySQL doesn't support this.

While it wouldn't stop the annoyance of having to pad things coming into the database or in search routines, on the client or in procs in the database, it would guarantee you that the fields were clean at the lowest level.

I find using constraints like this help avoid things getting badly out of hand.

As far as the optimization by using numbers, if they have to accommodate non-numeric characters in the future, that's not going to be an option.

It is very common to have natural keys (which could be candidates for a primary key) with varchar/char data, but yet instead enforce referential integrity on surrogate keys (usually some kind of autonumbering integer which is simply an internal reference, and often the clustered index and primary key).

Community
  • 1
  • 1
Cade Roux
  • 88,164
  • 40
  • 182
  • 265
  • Yes, using triggers to pad the data could be an option... CHECK could do my life easier, I guess... but I'm stuck with MySQL. So, basically you said that without those CHECKs, your data could get badly out of hand? A long time ago, I proposed to use the internal AUTOCOUNT reference but they refuse the idea as part of the data is imported directly from each of the windows clients, and they think it would just complicate it. – lepe Aug 24 '11 at 03:24
  • 1
    @lepe If you can't enforce it at the database, anything can happen. I always want to know what the services/integrity the database will guarantee at the perimeter it exposes to consumers (tables/views or procs). To mitigate not having it as a declarative constraint, you could have an hourly/daily process which checks for potential problems and nip them in the bud. It would be relatively easy to write/code generate and keep updated using the database metadata and give you some better confidence. – Cade Roux Aug 24 '11 at 03:26
  • +1 Exactly, "if you can't enforce it at the database, anything can happen": That is the idea that I need to extend. Running a cron to check the data is also a nice idea in the case I can't convince them, , i will think about it. – lepe Aug 24 '11 at 03:35
  • @lepe In complex systems, there are constraints that are simply unrealistic to implement at the database level. Not just because they are higher-order business logic, but because it isn't worth figuring out where to put the triggers - or perhaps the trigger would need to apply to a wide variety of interrelated tables. In this case, automated exception reporting which can covers an entire data model is your friend. – Cade Roux Aug 24 '11 at 03:42