MySQL performance issue on ~3million rows containing MEDIUMTEXT?

Question

I had a table with 3 columns and 3600K rows. Using MySQL as a key-value store.

The first column id was VARCHAR(8) and set to primary key.The 2nd and 3rd columns were MEDIUMTEXT. When calling SELECT * FROM table WHERE id=00000 MySQL took like 54 sec ~ 3 minutes.

For testing I created a table containing VARCHAR(8)-VARCHAR(5)-VARCHAR(5) where data casually generated from numpy.random.randint. SELECT takes 3 sec without primary key. Same random data with VARCHAR(8)-MEDIUMTEXT-MEDIUMTEXT, the time cost by SELECT was 15 sec without primary key.(note: in second test, 2nd and 3rd column actually contained very short text like '65535', but created as MEDIUMTEXT)

My question is: how can I achieve similar performance on my real data? (or, is it impossible?)

I'm sorry it's 2min 54sec, not 54sec. Time cost on real data is 3min (timed by mysql and python `time.clock()`), on test data it is 15sec. — Nriuam, Oct 31 '16 at 08:12
Are you using an index on your 'data' column in your test table? Or is it only primary key there? I'm pretty sure your bottleneck is disk I/O and sub-optimal use of varchar as primary key — Wintermute, Oct 31 '16 at 08:15
Like @Imanez said, why use varchar on your id? Searching on an indexed integer column is much faster. — Laurent Meganck, Oct 31 '16 at 08:16
@Imanez should I use numeric int or TINYTEXT? I'm confused by the test, which used VARCHAR but much faster than in real scene — Nriuam, Oct 31 '16 at 08:17
@Stan McGeek no, test table had no primary key or any index. So making VARCHAR primary key could be worse than no index?Maybe I need to read some doc... completely new to MySQL/databases — Nriuam, Oct 31 '16 at 08:22
What is the **REAL** data types that you need for all three columns? Also, is the key column unique or you expect more than one row with the same key value? Though it is true that integer processing is faster than string, the fact that your select looks only at the key field, implies that the actual search takes place ONLY on the key field (i.e. index). As such, 2m 54secs looks to me an **eternity**. — FDavidov, Oct 31 '16 at 08:23
@Laurent Meganck got it. Came from python and just arbitrary used VARCHAR... my fault — Nriuam, Oct 31 '16 at 08:24
@FDavidov I mean REAL data contains string range from 0 length to more than 65535, while test data uses short random strings. The key column is unique. Maybe the bottleneck is as Stan McGeek has said. I'll run a test and post the result below — Nriuam, Oct 31 '16 at 08:43
@XiaowenFeng, So the length of your key field might be up to 65535 chars... That would be a very, **VERY** bad selection for a key and hence our performance issues. Still, if this is what you need, you will have to live with that type of performance. Here I assume that "real data" means the one used as a key (first column). — FDavidov, Oct 31 '16 at 08:46
@FDavidov the real data use unique ID as key; test data assigned no primary key(yes the first column, or ID, in **test** data is redundant, apologize for confusing. I should pick a better naming). I use ID as string in my python part so set to a non-numeric by fault:( at retrying/testing part now — Nriuam, Oct 31 '16 at 08:53
Dear @XiaowenFeng, how you use it in you python is irrelevant. Please answer this: Is the key field (column 1) of NUMERIC ONLY NATURE or it may include letters, signs, blanks, etc.? — FDavidov, Oct 31 '16 at 08:55
@XiaowenFeng it's quite a complex topic if you want to actually know underlying differences between datatypes and their indexing schemes and it's not really the point here. You should try matching your data types to your actual data. First of all your primary key should be an integer, I'd say ~99% of the time as primary key is used to order rows for storage. Plus fetching data by indexed integer equivalnce is few magnitudes faster than pretty much anything else. As always, it's try->benchmark->repeat anyways, so just try following those tips from the comments and compare results — Wintermute, Oct 31 '16 at 09:02
@FDavidov They are numeric integer and do not starts with zero(s) — Nriuam, Oct 31 '16 at 09:06
@Stan McGeek ok I got the point. btw while googling I came across [this post](http://stackoverflow.com/questions/332300/is-there-a-real-performance-difference-between-int-and-varchar-primary-keys) asking about performance between VARCHAR/INT as primary key, and in Timothy Khouri's answer it seems they can be similar in some case, Paul Tomblin's also gives a similar idea. — Nriuam, Oct 31 '16 at 09:30
hard to tell without actual benchmark data, but seems legit. If your varchars are small enough and you allocate enough memory for index buffers, you shouldn't see any difference between varchar and integer on select. You will see the difference though on insert, as there's ordering operation per each insert, which will be negligible when using incrementing integers, but quite big otherwise. There's a really good point made by @Solarflare in the answers section: aren't you forgetting quotes in your WHERE clause? If you are, then here's your problem — Wintermute, Oct 31 '16 at 09:44
@Stan McGeek It turns out Solarflare's point was completely right. — Nriuam, Nov 01 '16 at 03:04

score 0 · Answer 1 · answered Oct 31 '16 at 08:18

0

If your id column contains only numbers so define it as int , because int will give you better performance ( it is more faster)

answered Oct 31 '16 at 08:18

Imanez

500
5
13

score 0 · Answer 2 · answered Oct 31 '16 at 09:11

Make the column in your table (the one defined as key) integer and retry. Check first performance by running a test within your DB (workbench or simple command line). You should get a better result. Then, and only if needed (I doubt it though), modify your python to convert from integer to string (and/or vise-versa) when referencing the key column.

Solarflare · Accepted Answer · 2016-10-31T09:41:26.937

If you use

SELECT * FROM `table` WHERE id=00000

instead of

SELECT * FROM `table` WHERE id='00000'

you are looking for all strings that are equal to an integer 0, so MySQL will have to check all rows, because '0', '0000' and even ' 0' will all be casted to integer 0. So your primary key on id will not help and you will end up with a slow full table. Even if you don't store values that way, MySQL doesn't know that.

The best option is, as all comments and answers pointed out, to change the datatype to int:

alter table `table` modify id int;

This will only work if your ids casted as integer are unique (so you don't have e.g. '0' and '00' in your table).

If you have any foreign keys that references id, you have to drop them first and, before recreating them, change the datatype in the other columns too.

If you have a known format you are storing your values (e.g. no zeros, or filled with 0s up to the length of 8), the second best option is to use this exact format to do your query, and include the ' to not cast it to integer. If you e.g. always fill 0 to 8 digits, use

SELECT * FROM `table` WHERE id='00000000';

If you never add any zeros, still add the ':

SELECT * FROM `table` WHERE id='0';

With both options, MySQL can use your primary key and you will get your result in milliseconds.

Yes I forgot the quotes!! Thank you so much for pointing it out. — Nriuam, Nov 01 '16 at 03:06

MySQL performance issue on ~3million rows containing MEDIUMTEXT?

3 Answers3