-1

I am designing a quiz app and i am stuck on how to design the answers table.

Assume i have the following tables:

User(user_id,...other columns)
Question(question_id,user_id,...other columns)
QuestionAnswers(question_id,answer_id... other columns)

Now what to do about the UserAnswers table? The structure I have in mind is this:

UserAnswers(user_id,question_id,answer_id,.. other columns)

The structure that I have made works great at start, but the performance starts degrading once I reach 10 million rows. Considering my app, if 10,000 questions are present, and there are 1000 users in the system and each user answers each of the question. I will easily get to 10 million rows, and as users and questions grow the table size will grow dramatically.

What is a better approach to store these answers?

Moreover, I designed the system in MySQL. Do you think the same table structure would work better in some other DBMS?

mysql> explain select count(*) from user_answer where question_id = 9845;
+----+-------------+-------------+------------+------+---------------+-------------+---------+-------+------+----------+-------------+
| id | select_type | table       | partitions | type | possible_keys | key         | key_len | ref   | rows | filtered | Extra       |
+----+-------------+-------------+------------+------+---------------+-------------+---------+-------+------+----------+-------------+
|  1 | SIMPLE      | user_answer | NULL       | ref  | question_id   | question_id | 4       | const |  645 |   100.00 | Using index |
+----+-------------+-------------+------------+------+---------------+-------------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)


mysql> explain select count(*) from user_answer;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                        |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+------------------------------+
|  1 | SIMPLE      | NULL  | NULL       | NULL | NULL          | NULL | NULL    | NULL | NULL |     NULL | Select tables optimized away |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+------------------------------+
1 row in set, 1 warning (0.00 sec)


mysql> select count(*) from user_answer;
+----------+
| count(*) |
+----------+
| 20042126 |
+----------+
1 row in set (11 min 30.33 sec)
Eric B.
  • 4,622
  • 2
  • 18
  • 33
  • 1
    While this question might not be a fit for StackOverflow, I'm intrigued to read the answers as this is quite a practical question – Ben Jul 05 '16 at 13:42
  • @Ben I know it should be moved to http://dba.stackexchange.com/ but i want a larger audience. – Eric B. Jul 05 '16 at 13:43
  • 1
    @EricB. 10 million is not a very big value fro mysql in production we have 250 records in a single mysql table but its performing great . you have to index your table according to queries – Mahesh Madushanka Jul 05 '16 at 13:46
  • @MaheshMadushanka, `user_id,question_id` is the primary key. I think this is a good index, because i would only be selecting using a where clause for the question_id or user_id. Do you have any better index in mind? I read about sharding, but i don't want to move to a multiple server architecture for such a simple table. – Eric B. Jul 05 '16 at 13:48
  • can you execute explain statement and check. yes now we are planing to move in to shading becous after 200 million records its showing some small performance issues – Mahesh Madushanka Jul 05 '16 at 13:51
  • question_id in the composite primary is fine. In so far as you are going after the user_id too in the query. If not, as question_id is not left-most, the index is not used. See this [answer](http://stackoverflow.com/a/32620163) for the junction table `SCJunction` where I flipped the ordering for the second composite key `key (courseId,studentId)` to pick up other queries "going the other way". It all depends on your actual queries. Spend time running your queries thru explain like [here](http://stackoverflow.com/a/38189113) and profile the results. Test new ideas. Ask for help with `Explain` – Drew Jul 05 '16 at 14:09
  • @MaheshMadushanka I have edited my question. Please see it. – Eric B. Jul 05 '16 at 14:29
  • @Drew, what you mean to say is that moving question_id column to the left would result in faster queries? – Eric B. Jul 05 '16 at 14:31
  • I am saying it depends on your particular query at the moment. And just taking on `b` to a key like `key(a,b)` does not make the key used if only going after `b`. Which is why I showed the `SCJunction` because there are other queries that want just `b`. The key takeway is "what is left-most". – Drew Jul 05 '16 at 14:35
  • @Drew I got your point now. The count(*) query takes a lot of time, how to optimize it given the above table structure? You can see in the question that the count query takes a lot of time. – Eric B. Jul 05 '16 at 14:39
  • You may have a deadlock, see [this](http://stackoverflow.com/a/918092) – Drew Jul 05 '16 at 14:55
  • 1
    @Drew amazing! This solved the count issue, and may be the solution I am looking for. I don't need to show my user the latest answer count, I can afford if my data doesn't have the latest 10 records. – Eric B. Jul 05 '16 at 14:59

1 Answers1

2

A general indexing concept is what is left-most in a key. Let's take for example the following key (whether or not it is primary is not the focus here)

key(a,b,c)

For queries such as

select region from myTable where c='Turkey'

the above key is not used. You may endure a tablescan.

For queries such as

select region from myTable where a=17 and c='Turkey'

The key is used up to the most left-most part used, which is a, as b is not in the query. So the key is useful, but not entirely useful. Meaning, at least it quickly gets you to the segmented a rows, but performs a where from there.

Let me say the above another way: In that query, it does not fully use the index to get to c. It knows b is not in the mix of the query, and does not magically jump over b to get to c while fully using the index. But at least the index is partially used.

This is why on thin index width such as ints, and with composites, I often create the second composite index "going the other way", as shown in this answer for Junction tables:

unique key(studentId,courseId,term), -- no duplicates allowed for the combo (note student can re-take it next term)
key (courseId,studentId),

ignore term in there for this discussion. Point being, those are thin ints (relatively low overhead). The second key will require overhead. So it comes at a cost, a cost I am willing to pay. But for queries going in the other direction, I am covered. Meaning, queries involving courseId without the studentId.

Note, my composite flip in the above is not a stellar one. It has often been pointed out to me that having it as shown causes unnecessary overhead. In particular, for the 2nd key, it should just be on courseId (a non-composite). If on the first key, for whatever reason I had term wedged into 2nd place, then it would be a valid example.

A better example would be

key (a,b,c),
key (c,b)

The above, among other things, would be useful for queries going against just c, and also b and c together. But not just b.

The Takeaway:

Resist the urge to splatter new indexes into your schema foolishly thinking they will be used. Especially for non-left-most columns not picked up in actual and frequent queries. And certainly not for those just mentioned and wider columns like varchar(100) times several flips in ordering in multiple indexes. All they do potentially is slow down the inserts and updates and offer, many times, zero performance gains in actual queries. So scrutinize it all.

All index choices come at a cost. Only you should make that determination for what is right for your system.

Community
  • 1
  • 1
Drew
  • 24,851
  • 10
  • 43
  • 78