2

I have a function like this:

function query($query)
{
    /* If query result is in cache already, pull it */
    if $cache->has(md5($query))
       return $cache->get(md5($query));

    /* Do the actual query here. mysqli_query() etc, whatever u imagine. */

    /* Set the cache with 10 minute expiration */
    $cache->set(md5($query), $queryOutput, array(10)); 
}

So basically, if I query SELECT * FROM USERS, it is cached automatically for 10 minutes.

I don't know if md5 is safe to rely on. For once, it creates 32 character string, which sounds a bit overkill. Second, md5 is known to give same character string as output on certain inputs. Is there any alternatives to md5 to identify an unique cache key? There is very little chance, that two completely different SQL queries may get the same md5 output and break some pages of the website, but it is still a chance I should predict right now and code accordingly.

One more thing is, I feel like such use of functions is considered bad practise. A new user may be inserted to USERS table after my cache, but SELECT * FROM USERS will still get the same md5 output, hence, ignore the newly inserted user. They may register with same nickname few times in 10 minute duration.

Should I pass a second parameter to my sensitive queries, like query($query, 'IGNORECACHE')? It doesn't sound logical to me. There will be too much things to keep in mind. How do you guys handle such issues?

I will appreciate it if you can reply my first question about md5 alternative for this case, and a short explanation of good use of SQL caching on my second question would be greatly appreciated.

Thank you.

Lisa Miskovsky
  • 896
  • 2
  • 9
  • 18
  • 2
    md5 is collidable, but it's not a such a trivial collision that `md5('simple string') == md5('other simple string')`. the odds that given a hash for one VALID sql string would collide with some other valid sql string are essentially impossible, especially if you're not using blobs/random-ish binary data in the queries. – Marc B Mar 12 '13 at 14:20
  • Why don't you use the [MySQL query cache](http://dev.mysql.com/doc/refman/5.5/en/query-cache.html)? There should be no need to implement it on your own. – Fabian Schmengler Mar 12 '13 at 14:21
  • @MarcB: Do you have any sources for that? Like a md5 test with millions of different strings? E_ERROR: It may be ok, but there is no better alternatives? fab: I use memcached. – Lisa Miskovsky Mar 12 '13 at 14:26
  • 1
    @LisaMiskovsky: md5 was **THE** go-to hash for the longest time. if it was trivially collidable, it would never have been used as widely as it was, especially for things like signing ssl certs and whatnot. if you're worried about collisions, then you shouldn't be using a hash in the first place. you can always implement buckets so that even if there WAS a collision, you only have to search through a small bucket for matches rather than the entire possible query string space. – Marc B Mar 12 '13 at 14:30
  • … but a cache might hash to begin with, so just use the query as the key and save yourself the trouble :) – Ry- Mar 12 '13 at 14:33
  • MD5 is generally accepted as a no-go for security situations but is still fine as a non-secure hashing algo. cf, "...to get a collision, on average, you'll need to hash 6 billion files per second for 100 years." http://stackoverflow.com/a/288519/463935 – James Alday Dec 04 '13 at 17:18

2 Answers2

4

If you are worried that you get a hash collision (i.e. two queries with the same md5), simply use the query itself as a key:

if $cache->has($query)
    return $cache->get($query);
$cache->set($query, $queryOutput, array(10)); 

Alternatively, you can use sha1. It returns a longer string so the chance for collisions is lower.

Don't worry about storing 32 or 40 bytes as a cache key, this won't noticably influence the performance of your web application.

MySQL also has its own query cache. If you do the same query again, MySQL will get it from its cache. If you insert a user into the users table, MySQL will recognize that it can no longer use the cache, but this is not the case with your cache class.

Sjoerd
  • 74,049
  • 16
  • 131
  • 175
  • What about blank spaces and symbols like ( or ? in our queries. Do I need to replace spaces with _ for example? I kinda want to use Memcached for caching everything. How would you do this if MySQL didn't had such feature? Are there other alternatives than passing a secondary parameter? – Lisa Miskovsky Mar 12 '13 at 14:33
  • Yes, memcached has limitations on the format of the key, so you can not use the query directly as key. I would not cache anything until it is clear that I have a performance problem, and then I would cache specific parts that are slow, not all queries with their results. – Sjoerd Mar 12 '13 at 14:38
  • I don't have a performance problems yet, but I don't want to wait the day when my MySQL server is destroyed with requests. I can easily make a basic caching solution. It's mostly for practise purposes. – Lisa Miskovsky Mar 12 '13 at 14:41
2

You can somewhat safely use md5, as the odds for colliding are very low. Another solution that could fit your needs is naming the cache key by your query type and appending parameters with appropriate separators:

$result = $cache->get('userdata_'.$userId);

When used carelessly, the cache can give you lots of unexpected problems when data in the database changes but the old data remains in cache. Make sure you have a way to invalidate caches and do so for all related caches when you add/modify/delete data.

Kaivosukeltaja
  • 15,541
  • 4
  • 40
  • 70
  • 1
    "There are only two hard things in Computer Science: cache invalidation and naming things." -- Phil Karlton – Sjoerd Mar 12 '13 at 14:40