29

Building a system that has the potential to get hammered pretty hard with hits and traffic. It's a typical Apache/PHP/MySql setup.

Have build plenty of systems before, but never had a scenario where I really had to make decisions regarding potential scalability of this size. I have dozens of questions regarding building a system of this magniture, but for this particular question, I am trying to decide on what to use as the data type.

Here is the 100ft view:

We have a table which (among other things) has a description field. We have decided to limit it to 255 characters. It will be searchable (ie: show me all entries with description that contains ...). Problem: this table is likely to have millions upon millions of entries at some point (or so we think).

I have not yet figured out the strategy for the search (the MySql LIKE operator is likely to be slow and/or a hog I am guessing for such a large # records), but thats for another SO question. For this question, I am wondering what the pro's and cons are to creating this field as a tinytext, varchar, and char.

I am not a database expert, so any and all commentary is helpful. Thanks -

Marius Burz
  • 4,555
  • 2
  • 18
  • 28
OneNerd
  • 6,442
  • 17
  • 60
  • 78
  • 1
    Since it looks like the question has been edited to definitively include the problem of searching the text field, you may want to edit the title to clarify reflect that. – TehShrike Sep 05 '11 at 00:21
  • @tehshrike: I didn't edit it. My question still remains the same (pros and cons of each). Just looks like everyone jumped on the 'search' piece. Apparently, lots of people are interested in the difference between each kind of data field. Based on everything I read, sounds like its a wash for search (just use whatever you want, then throw on an indexing software component when you need it). Still waiting for someone to break down pro's and cons of each though. – OneNerd Sep 05 '11 at 14:02

4 Answers4

17

Use a CHAR.

BLOB's and TEXT's are stored outside the row, so there will be an access penalty to reading them. VARCHAR's are variable length, which saves storage space by could introduce a small access penalty (since the rows aren't all fixed length).

If you create your index properly, however, either VARCHAR or CHAR can be stored entirely in the index, which will make access a lot faster.

See: varchar(255) v tinyblob v tinytext
And: http://213.136.52.31/mysql/540
And: http://forums.mysql.com/read.php?10,254231,254231#msg-254231
And: http://forums.mysql.com/read.php?20,223006,223683#msg-223683

Incidentally, in my experience the MySQL regex operator is a lot faster than LIKE for simple queries (i.e., SELECT ID WHERE SOME_COLUMN REGEX 'search.*'), and obviously more versatile.

Community
  • 1
  • 1
Seth
  • 45,033
  • 10
  • 85
  • 120
  • Thanks. researching the 4 articles now. Up-voted. Not sure which answer to accept yet though -- lots to dig through. – OneNerd Sep 03 '11 at 20:03
  • 1
    Some good information there, but none of it is quite relevant to your problem - it doesn't matter where it's stored, or if it's fixed-width or not - if you have millions of rows, you can't be using table scans. You'll need indexes, and normal indexes on ANY text field won't let you search for text in the middle of the field. – TehShrike Sep 03 '11 at 20:32
  • @tehshrike: Well, my question was not about search, it was about the pro's and con's of each data type. Like I mentioned in my question, the 'search' piece will be a separate question. I actually thought the info he gave was good. – OneNerd Sep 03 '11 at 20:49
  • @OneNerd "but thats for another SO question" - so you did! My mistake. It's such a big question, that's what got focused on. I would highly recommend checking out this book: http://mo4.us/IjO – TehShrike Sep 03 '11 at 21:48
  • @TehShrike - Actually, normal (b-tree) indexes in `CHAR` or `VARCHAR` can be used for `LIKE` searches: "The index also can be used for LIKE comparisons if the argument to LIKE is a constant string that does not start with a wildcard character." (from [here](http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html)). Should work fine for millions of rows if you have enough memory (maybe not 100s of millions though). – Seth Sep 04 '11 at 19:16
  • 1
    @Seth: It's true, as long as you are searching the beginning of the field, an index can be used. But that is not what OneNerd said he would be doing: "show me all entries with description that contains ..." – TehShrike Sep 04 '11 at 23:53
  • Oh, and as an aside, you can index TEXT fields as well, though you are required to specify a prefix length (capped at 1000 bytes, or 767 for InnoDB) http://dev.mysql.com/doc/refman/5.5/en/blob.html http://dev.mysql.com/doc/refman/5.5/en/column-indexes.html – TehShrike Sep 05 '11 at 00:17
2

I believe with varchar you've got a variable length stored in the actual database at the low levels, which means it could take less disk space, with the text field its fixed length even if a row doesn't use all of it. The fixed length string should be faster to query.

Edit: I just looked it up, text types are stored as variable length as well. Best thing to do would be to benchmark it with something like mysqlslap

In regards to your other un-asked question, you'd probably want to build some sort of a search index that ties every useful word in the description field individually to a description, then you you can index that and search it instead. will be way way faster than using %like%.

profitphp
  • 8,104
  • 2
  • 28
  • 21
  • 1
    I was checking out http://forums.mysql.com/read.php?24,105964,105964 and http://forums.mysql.com/read.php?10,254231,254581#msg-254581 - seems like varchar is the winner up to 255 characters for a variety of reasons. – VNO Sep 03 '11 at 18:55
  • Nice, looks like the benchmarking has already been done, so there you go. – profitphp Sep 03 '11 at 18:58
  • Thanks. researching those articles now. Up-voted. Not sure which answer to accept yet though -- lots to dig through. – OneNerd Sep 03 '11 at 20:03
2

In your situation all three types are bad if you'll use LIKE (a LIKE '%string%' won't use any index created on that column, regardless of its type) . Everything else is just noise.

I am not aware of any major difference between TINYTEXT and VARCHAR up to 255 chars, and CHAR is just not meant for variable length strings.

So my suggestion: pick VARCHAR or TINYTEXT (I'd personally go for VARCHAR) and index the content of that column using a full text search engine like Lucene, Sphinx or any other that does the job for you. Just forget about LIKE (even if that means you need to custom build the full text search index engine yourself for whatever reasons you might have, i.e. you need support for a set of features that no engine out there can satisfy).

Marius Burz
  • 4,555
  • 2
  • 18
  • 28
  • It's true, all of them are equally bad if you're looking at table scans. The old MyISAM engine does support fulltext indexing, but it's not a very useful implementation in my experience. – TehShrike Sep 03 '11 at 20:35
  • I didn't think for a second about using the one supported by MyISAM(not a fan of a primitive engine like this either). I was thinking about stuff like Lucene or Sphinx or really making a custom full text index if it makes sense. It's true, my answer was confusing, but I updated it to address this issue. – Marius Burz Sep 03 '11 at 20:45
2

If you want to search among millions of rows, store all these texts in a different table (which will decrease row size of your big table) and use VARCHAR if your text data is short, or TEXT if you require greater length.

Instead of searching with LIKE use a specialized solution like Lucene, Sphinx or Solr. I don't remember which, but at least one of them can be easily configured for real-time or near real-time indexing.

EDIT

My proposition of storing text in different table reduces IO required for main table, but when data is inserted it requires to keep an additional index and adds join overhead in selects, so is valid only if you use your table to read a few descriptions at once and other data from the table is is used more often.

piotrp
  • 3,755
  • 1
  • 24
  • 26
  • A single column of up to `255`(meaning: probably most of the time it will be around 100 chars or shorter) should not be a problem, so I don't think it really makes sense to store this columns in an external table. – Marius Burz Sep 03 '11 at 19:39
  • Depends on your data size - if you have less data in a row you can read it faster from disk and use cache more efficiently. 100 bytes less in 1 million of rows is 100 MB less to read during table scan = ~6400 less pages in a table (assuming standard InnoDB 16 KB per page). That's a cheap optimization that may pay off in the future. – piotrp Sep 03 '11 at 20:11
  • So let's build a table for every `varchar` column longer than 50 bytes? The problem I see is: why would you accept and do nothing about a full table scan in a performance critical situation? I don't strive to make a full table scan faster, I strive not to get into situations where full table scans are required. Obviously, it's not possible at all times, but where it matters it's a must in my books. – Marius Burz Sep 03 '11 at 20:16
  • table scan it NOT not the way to go anyway. – Your Common Sense Sep 03 '11 at 20:18
  • I used table scan only as an example in which I can easily calculate advantages. @Mariusz - not every `varchar`, but if table is to have millions of entries then something like description can be easily stored somewhere else, reducing row size by eg. 50% (and allowing for more rows per disk page) – piotrp Sep 03 '11 at 20:49
  • @Crack How about the extra IO when storing that column in a different table and selecting that column, or filtering based on it without hitting an index built on top of it, or updating it, all these while also involving other columns which are stored in the main table? I could end up pretty nasty with a lot of extra IO. Your suggestion might be correct in some case, but the decision depends on too many things to actually generalize it, and without proof that it's beneficial on a per case basis I'd just not use it. – Marius Burz Sep 03 '11 at 21:30
  • Well, the same can be said about other advices given for this question. I assumed that something called 'description' will be used seldom enough to warrant this separation, others give advice that work for simple OLTP use-cases. – piotrp Sep 03 '11 at 21:56
  • Indeed, unfortunately there is no silver bullet, everything must be considered on a case basis. – Marius Burz Sep 03 '11 at 22:19
  • I think @Marius Burz might be overestimating the cost of joining on another table, but at the same time, I don't think there is much advantage to be gained by putting this column in another table. If you use the TEXT type, there is a good chance the data is already stored in another physical location (depending on the storage engine). – TehShrike Sep 05 '11 at 00:30
  • @TehShrike There's definitely a higher IO cost associated with joining than by just using a single table. – Marius Burz Sep 05 '11 at 00:33