2

My team working on a php/MySQL website for a school project. I have a table of users with typical information (ID,first name, last name, etc). I also have a table of questions with sample data like below. For this simplified example, all the answers to the questions are numerical.

Table Questions:

qid | questionText
1   | 'favorite number'
2   | 'gpa'
3   | 'number of years doing ...'

etc.

Users will have the ability fill out a form to answer any or all of these questions. Note: users are not required to answer all of the questions and the questions themselves are subject to change in the future.

The answer table looks like this:

Table Answers:

uid | qid | value
 37 |  1  |  42
 37 |  2  |  3.5
 38 |  2  |  3.6

etc.

Now, I am working on the search page for the site. I would like the user to select what criteria they want to search on. I have something working, but I'm not sure it is efficient at all or if it will scale (not that these tables will ever be huge - like I said, it is a school project). For example, I might want to list all users whose favorite number is between 100 and 200 and whose GPA is above 2.0. Currently, I have a query builder that works (it creates a valid query that returns accurate results - as far as I can tell). A result of the query builder for this example would look like this:

SELECT u.ID, u.name (etc)
FROM User u
JOIN Answer a1 ON u.ID=a1.uid 
JOIN Answer a2 ON u.ID=a2.uid
WHERE 1
AND (a1.qid=1 AND a1.value>100 AND a1.value<200)
AND (a2.qid=2 AND a2.value>2.0)

I add the WHERE 1 so that in the for loops, I can just add " AND (...)". I realize I could drop the '1' and just use implode(and,array) and add the where if array is not empty, but I figured this is equivalent. If not, I can change that easy enough.

As you can see, I add a JOIN for every criteria the searcher asks for. This also allows me to order by a1.value ASC, or a2.value, etc.

First question: Is this table organization at least somewhat decent? We figured that since the number of questions is variable, and not every user answers every question, that something like this would be necessary.

Main question: Is the query way too inefficient? I imagine that it is not ideal to join the same table to itself up to maybe a dozen or two times (if we end up putting that many questions in). I did some searching and found these two posts which seem to kind of touch on what I'm looking for:

Mutiple criteria in 1 query

This uses multiple nested (correct term?) queries in EXISTS

Search for products with multiple criteria

One of the comments by youssef azari mentions using 'query 1' UNION 'query 2'

Would either of these perform better/make more sense for what I'm trying to do?

Bonus question:

I left out above for simplicity's sake, but I actually have 3 tables (for number valued questions, booleans, and text) The decision to have separate tables was because (as far as I could think of) it would either be that or have one big answers table with 3 value columns of different types, having 2 always empty.

This works with my current query builder - an example query would be

SELECT u.ID,...
FROM User u
JOIN AnswerBool b1 ON u.ID=b1.uid
JOIN AnswerNum n1 ON u.ID=n1.uid
JOIN AnswerText t1 ON u.ID=t1.uid 
WHERE 1
AND (b1.qid=1 AND b1.value=true)
AND (n1.qid=16 AND n1.value<999)
AND (t1.qid=23 AND t1.value LIKE '...')

With that in mind, what is the best way to get my results?

One final piece of context: I mentioned this is for a school project. While this is true, then eventual goal (it is an undergrad senior design project) is to have a department use our site for students creating teams for their senior design. For a rough estimate of size, every semester, the department would have somewhere around 200 or so students use our site to form teams. Obviously, when we're done, the department will (hopefully) check our site for security issues and other stuff they need to worry about (what with FERPA and all). We are trying to take into account all common security practices and scalablity concerns, but in the end, our code may be improved by others.

UPDATE As per nnichols suggestion, I put in a decent amount of data and ran some tests on different queries. I put around 250 users in the table, and about 2000 answers in each of the 3 tables. I found the links provided very informative

(links removed because I can't hyperlink more than twice yet) Links are in nnichols' response

as well as this one that I found:

http://phpmaster.com/using-explain-to-write-better-mysql-queries/

I tried 3 different types of queries, and in the end, the one I proposed worked the best.

First: using EXISTS

SELECT u.ID,...
FROM User u WHERE 1
AND EXISTS 
    (SELECT * FROM AnswerNumber 
    WHERE uid=u.ID AND qid=# AND value>#) -- or any condition on value
AND EXISTS
    (SELECT * FROM AnswerNumber
    WHERE uid=u.ID AND qid=another # AND some_condition(value))
AND EXISTS
    (SELECT * FROM AnswerText
...

I used 10 conditions on each of the 3 answer tables (resulting in 30 EXISTS)

Second: using IN - a very similar approach (maybe even exactly?) which yields the same results

SELECT u.ID,...
FROM User u WHERE 1
AND (u.ID) IN (SELECT uid FROM AnswerNumber WHERE qid=# AND ...)
...

again with 30 subqueries.

The third one I tried was the same as described above (using 30 JOINs)

The results of using EXPLAIN on the first two were as follows: (identical)

The primary query on table u had a type of ALL (bad, though users table is not huge) and rows searched was roughly twice the size of the user table (not sure why). Each other row in the output of EXPLAIN was a dependent query on the relevant answer table, with a type of eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall not bad.

For the query I suggested (JOINing):

The primary query was actually on whatever table you joined first (in my case AnswerBoolean) with type of ref (better than ALL). The number of rows searched was equal to the number of questions answered by anyone (as in 50 distinct questions have been answered by anyone) (which will be much less than the number of users). For each additional row in EXPLAIN output, it was a SIMPLE query with type eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall almost the same, but a smaller starting multiplier.

One final advantage to the JOIN method: it was the only one I could figure out how to order by various values (such as n1.value). Since the other two queries were using subqueries, I could not access the value of a specific subquery. Adding the order by clause did change the extra field in the first query to also have 'using temporary' (required, I believe, for order by's) and 'using filesort' (not sure how to avoid that). However, even with those slow-downs, the number of rows is still much less, and the other two (as far as I could get) cannot use order by.

Community
  • 1
  • 1
  • One detail I might should have mentioned: In all answer tables, the primary key is uid and qid. Users can answer a question or update the existing answer for the question. There will not be multiple entries for a question for a particular user. – David Ferretti Apr 06 '12 at 18:45
  • Is your question table dynamic or fixed? In other words will it expand or new questions will be added later? – rs. Apr 06 '12 at 19:02
  • New questions will be added or removed over time. – David Ferretti Apr 06 '12 at 19:05
  • Wow, if only I had known to call this (am I correct?) a has-many-through relation, I could have found this much sooner - http://stackoverflow.com/questions/7364969/how-to-filter-sql-results-in-a-has-many-through-relation The answer by @Erwin Brandstetter has a ton of info, including technique 4, 5, and 6 - which are the three I tested! He even mentions 4 and 5 being equivalent. – David Ferretti Apr 07 '12 at 16:33

1 Answers1

0

You could answer most of these questions yourself with a suitably large test dataset and the use of EXPLAIN and/or the profiler.

Your INNER JOINs will almost certainly perform better than switching to EXISTS but again this is easy to test with a suitable test dataset and EXPLAIN.

user1191247
  • 10,808
  • 2
  • 22
  • 32
  • Thanks for the input! I've seen EXPLAIN but never SHOW PROFILE. I am working on generating some junk data to test on, and then I will get right to the tests themselves. This type of search seems fairly common, so I was wondering if there was a standard way of performing the query (I pretty much made up the one I provided above, though I'm sure it's been done). I will test these and other methods I may come across for performance - I was just asking in case there was a tried-and-true best method for this type of query. – David Ferretti Apr 06 '12 at 19:29
  • I finally got around to running some tests - I'll put info in the question – David Ferretti Apr 07 '12 at 15:12
  • Well done for taking the time to post your findings. What were the query times for your three different queries? You might want to experiment with indices on the value fields to see if you can do away with filesort although with these small datasets it should not incur much of a performance penalty. I hope this proved to be a useful learning exercise. – user1191247 Apr 07 '12 at 17:06
  • I tried adding indices to the value fields, but the filesort remained. None of the queries (though they are small) had any noticeable performance difference with or without the index as well. Even though the two similar queries searched through 10 times as many rows, they both completed in about 20 ms, where the JOIN completed in around 40 ms. I think for our site, having MySQL do order by will be more convenient than some mix of further queries and/or sorting in php, despite the slight increase in time. – David Ferretti Apr 07 '12 at 17:43