0

I am building a dynamic MySQL user search query that is to be able to handle many different search criteria. I considered writing a stored procedure, but ended up building the query in the client (prepared statement in PHP). Among the criteria is to be able to search for users' ages, i.e. between X and Y years old. I am wondering how to do this as efficiently as possible. The end query will be fairly complex and have several joins and may be run on a few million rows in the future, so I need to optimize it where I can. I am storing a user's date of birth in an indexed DATE column that has the YYYY-MM-DD format. I have the following User Defined Function (UDF) for calculating a user's age:

RETURN (DATE_FORMAT(current_time, '%Y') - DATE_FORMAT(date_of_birth, '%Y') - (DATE_FORMAT(current_time, '00-%m-%d') < DATE_FORMAT(date_of_birth, '00-%m-%d')));

The details of the calculation are not important; I am more concerned with how it is used. One of my worries is that using this UDF in my WHERE clause would slow the query down significantly, because it needs to be run on every single row, even though I made the UDF deterministic. I cannot guarantee that there will be other criteria to narrow the matching rows down before checking ages. I can't just check the date of birth against a date, because that would not be accurate. I am thinking whether pulling the above calculation out of the UDF and embedding it directly within the query's WHERE clause would be make a noticeable difference (I think yes). The downside would then be that the WHERE clause is further complicated with such a calculation (or actually two, unless there is a way to reuse the result). But I guess there is no way to avoid those calculations. Is performing this calculation in the WHERE clause the way to go in regards to performance, or is there a better way?

Theoretically, I guess I could even add an age column in the user table and calculate the age when a user registers and run a scheduled job/cronjob every night to update ages for users that have a birthday of today (if I can select that efficiently). This would surely speed up my search query, but would introduce redundant data. Therefore I would really only want to do this if the calculation cannot be done efficiently within the search query itself.

So, to sum up: I need to search for users within a range of ages (e.g. 25 to 30). Should I calculate the age in the WHERE clause, or would that be very slow because it has to be done on every row? Is it a sacrifice I have to make, or do I have better alternatives?

Any help is much appreciated.

ba0708
  • 10,180
  • 13
  • 67
  • 99

2 Answers2

4

If you want an accurate age calculation based on the current date, then you should try something like:

where date_of_birth between date(now()) - interval 30 years and date(now()) - interval 25 year

In this case, you are not doing any transformation of date_of_birth, so an index can be used for the query.

Also, you shouldn't use expressions like:

DATE_FORMAT(current_time, '%Y') - DATE_FORMAT(date_of_birth, '%Y')

DATE_FORMAT() converts the argument to a string. You want a number, so just use:

year(now()) - year(date_of_birth)

It saves the conversion from a date to a string to an int and just goes directly to an int.

EDIT:

To handle the case of "25" really meaning "up to 26", implement the logic with explicit comparisons:

where date_of_birth >= date(now()) - interval 30 years and
      date_of_birth < date(now()) - interval 26 year
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thank you for the answer. For the age, a person who is 25 and a half years old should still be matched by a query that searches for 20-25 year olds, for example. So I will not be searching _strictly_ from today's date by subtracting 25 years, but wish to match those who are 25 but not yet 26. I hope that makes sense and I am sorry if I was not clear about that. Do you know how to accomplish that? That's also why I did not go with the interval approach. The function I use is just one that I found somewhere, knowing that the conversion is not good practice. I will update it like you said. – ba0708 Jul 15 '13 at 12:18
  • 1
    Computing what their DOB would have to be to meet the criteria rather than trying to compute their current age is definitely the way to go. – Damien_The_Unbeliever Jul 15 '13 at 13:20
2

It's not really about the performance of UDFs or stored procedures. Whenever you use a function around a column MySQL cannot use indexes on it.

If you don't expect Highlander to be in your database a tinyint unsigned column for age will be enough (0-255). This costs 1 byte / row. You can put an index on it. The overhead what this column adds to the table is insignificant. Don't be afraid of storage space. Storage performance on the other hand is much bigger concern. A full scan search costs much more than this 1 byte extra column.

You can update this column with a trigger on the date_of_birth column. And of course if you put proper indexes on the table the nightly cronjob can efficiently select the row where date_of_birth = DATE(NOW()) and increase the age by one. (I would do this with a stored procedure so everything could be done in MySQL).

ps.: The function you wrote seems to be a stored function rather not a UDF. Stored function are written in SQL and stored in MySQL. UDF are written in C compiled to .so or .dll files and loaded to MySQL. For more info you can check: Help with SP and UDF?

Community
  • 1
  • 1
Károly Nagy
  • 1,734
  • 10
  • 13
  • Thank you. I totally agree with you. However, my concern was not the additional byte, but more to avoid "maintaining" the ages. While it would be relatively simple, cronjobs, triggers and such do complicate my application as a whole a bit more. That's not a bad thing if it is necessary (it would indeed give me great performance), but I would be happier to go for such a solution if the comparison cannot be done efficiently in the search query. Otherwise, I would rather handle everything there for the sake of simplicity. :-) – ba0708 Jul 15 '13 at 11:52