1

I have a web app which is pretty CPU intensive ( it's basically a collection of dictionaries, but they are not just simple dictionaries, they do a lot of stuff, anyway this is not important ). So in a CPU intensive web app you have the scaling problem, too many simultaneous users and you get pretty slow responses.

The flow of my app is this:

js -> ajax call -> php -> vb6 dll -> vb6 code queries the dictionaries and does CPU intensive stuff -> reply to php -> reply to js -> html div gets updated with the new content. Obviously in a windows env with IIS 7.5. PHP acts just as a way of accessing the .dll and does nothing else.

The content replied/displayed is html formatted text. The app has many php files which call different functions in the .dll.

So in order to avoid the calling of the vb6 dll for each request, which is the CPU intensive part, I'm thinking of doing this:

example ajax request:

php file: displayconjugationofword.php
parameter: word=lol&tense=2&voice=active

So when a user makes the above request to displayconjugationofword.php, i call the vb6 dll, then just before giving back the reply to the client, I can add in a MYSQL table the request data like this:

filename, request, content
displayconjugationofword.php, word=blahblah&tense=2&voice=active, blahblahblah

so next time that a user makes the EXACT same ajax request, the displayconjugationofword.php code, instead of calling the vb6 dll, checks first the mysql table to see if the request exists there and if it does, it fetches it from there.

So this mysql table will gradually grow in size, reaching up to 3-4 million rows and as it grows the chance of something requested being in the db, grows up too, which theoretically should be faster than doing the cpu intensive calls ( each anywhere from 50 to 750ms long ).

Do you think this is a good method of achieving what I want? or when the mysql table reaches 3-4 million entries, it will be slow too ?

thank you in advance for your input.

edit

i know about iis output caching but i think it's not useful in my case because:

1) AFAIK it only caches the .php file when it becomes "hot" ( many queries ).

2) i do have some .php files which call the vb6 but the reply is random each time.

MirrorMirror
  • 186
  • 8
  • 36
  • 70
  • How about storing the HTML as flatfile and omitting the db? – Gerben Jacobs Aug 30 '13 at 08:50
  • @GerbenJacobs impossible, the combinations of all possible requests are millions. – MirrorMirror Aug 30 '13 at 12:30
  • Have you considered using something like `memcached`? You can specify the expire time for each cache entry, and it will also remove old entries when its memory fills up. – Barmar Sep 02 '13 at 07:00
  • Can you predict what types of queries your users will make and build up your cache quickly rather than waiting for those users to make those queries? – Dmitri Zaitsev Sep 04 '13 at 14:07

3 Answers3

1

I'm not an expert but this is an interesting logic problem. Hopefully what I've set out below will help or at least stimulate comments that may or may not make it useful.

to an extent, the answer is going to depend on how many queries you are likely to have, how many at once and whether the mysql indexing will be faster than your definitive solution.

A few thoughts then:

It would be possible to pass caching requests on to another server easily which would allow essentially infinite scaling.

Humans being as they are, most word requests are likely to involve only a few thousand words so you will, probably, find that most of the work being done is repeat work fairly soon. It makes sense then to create an indexable database.

Hashing has in the past been suggested as a good way to speed indexing of data. Whether this is useful or not will to an extent depend on the length of your answer.

If you are very clever, you could determine the top 10000 or so likely questions and responses and store them in a separate table for even faster responses. (Guru to comment?)

Does your dll already do caching of requests? If so then any further work will probably slow your service down.

This solution is amenable to simple testing using JS or php to generate multiple requests to test response speeds using or not using caching. Whichever you decide, I reckon you should test it with a large amount of sample data.

Robert Seddon-Smith
  • 987
  • 1
  • 9
  • 13
  • thanks for your reply Robert. By "pass caching requests on to another server easily" you mean the kind of caching that I suggested or something else ? Also I'm not sure I understood the "Does your dll already do caching of requests?" the dll gets called each time by php and immediately after that "dies". I don't think it's possible for the dll to somehow "remember" requests. – MirrorMirror Aug 31 '13 at 10:29
1

I love these situations/puzzles! Here are the questions that I'd ask first, to determine what options are viable:

  1. Do you have any idea/sense of how many of these queries are going to be repeated in a given hour, day, week? Because... the 'more common caching technique' (i.e the technique I've seen and/or read about the most) is to use something like APC or, for scalability, something like Memcache. What I've seen, though, is that these are usually used for < 12 hour-long caches. That's just what I've seen. Benefit: auto-cleanup of unused items.

  2. Can you give an estimate of how long a single 'task' might take? Because... this will let you know if/when the cache becomes unproductive - that is, when the caching mechanism is slower than the task.

Here's what I'd propose as a solution - do it all from PHP (no surprise). In your work-flow, this would be both PHP points: js -> ajax call -> php -> vb6 dll -> vb6 code queries the dictionaries and does CPU intensive stuff -> reply to php -> reply to js -> html div...

Something like this:

  1. Create a table with columns: __id, key, output, count, modified

    1.1. The column '__id' is the auto-increment column (ex. INT(11) AUTO_INCREMENT) and thus is also the PRIMARY INDEX

    1.2 The column 'modified' is created like this in MySQL: modified TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP

    1.3 'key' = CHAR(32) which is the string length of MD5 hashes. 'key' also has a UNIQUE INDEX (very important!! for 3.3 below)

    1.4 'output' = TEXT since the VB6 code will be more than a little

    1.5 'count' = INT(8) or so

  2. Hash the query-string ("word=blahblah&tense=2&voice=active"). I'm thinking something like:$key = md5(var_export($_GET, TRUE)); Basically, hash whatever will give a unique output. Deducing from the example given, perhaps it might be best to lowercase the 'word' if case doesn't matter.

  3. Run a conditional on the results of a SELECT for the key. In pseudo-code:

    3.1. $result = SELECT output, count FROM my_cache_table_name WHERE key = "$key"

    3.2. if (empty($result)) {
                $output = result of running VB6 task
                $count = 1
          else
                $count = $result['count'] + 1

    3.3. run query 'INSERT INTO my_cache_table_name (key, output, count) VALUES ($key, $output, $count) ON DUPLICATE KEY UPDATE count = $count'

    3.4. return $output as "reply to js"

Long-term, you will not only have a cache but you will also know what queries are being run the least and can prune them if needed. Personally, I don't think such a query will ever be all that time-consuming. And there are certainly things you might do to optimize the cache/querying (that's beyond me).

So what I'm not stating directly is this: the above will work (and is pretty much what you suggested). By adding a 'count' column, you will be able to see what queries are done a lot and/or a little and can come back and prune if/as needed.

If you want to see how long queries are taking, you might create another table that holds 'key', 'duration', and 'modified' (like above). Before 3.1 and 3.3, get the microtime(). If this is a cache-hit, subtract the microtime()s and store in this new table where 'key' = $key and 'duration' = 2nd microtime() - 1st microtime(). Then you can come back later, sort by 'modified DESC', and see how long queries are taking. If you have a TON of data and still the latest 'duration' is not bad, you can pull this whole duration-recording mechanism. Or, if bored, only store the duration when $key ends in a letter (just to cut down on it's load on the server)

David M
  • 423
  • 3
  • 11
  • David, thank you for your long and detailed reply. May I ask about step 2, is there any particular reason for hashing the query string? – MirrorMirror Sep 02 '13 at 14:30
  • My thinking was/is two-fold. 1, it keeps you from having a "key" field that is huge in order to accommodate any query-string. 2, MD5 is pretty fast so as not to introduce very much delay at all... makes for ideal "keys" in both the creation of the key as well as searching for that key. Does that make sense as to why I suggested an MD5 hash? (sorry, forgot to explain that before) – David M Sep 02 '13 at 17:56
  • Yes, smaller length key is better, I was just worried about potential rare collisions – MirrorMirror Sep 02 '13 at 20:48
  • Aye - there are collisions. To that end, you could go 40 characters long and use sha1() instead. Still pretty fast but a larger key-space means less likely to have collision. For what it's worth, someone did some simple speed-tests a in 2010 here: http://stackoverflow.com/questions/2722943/is-calculating-an-md5-hash-less-cpu-intensive-than-sha-1-or-sha-2 Then again, unless you're doing 30,000 hashes/second, it might not matter :-) – David M Sep 02 '13 at 22:26
1

In order to get maximum performance for your example you need to follow basic cache optimization principle.

I'm not sure if you application logic allows it but if it is it will give you a huge benefit: you need to distinguish requests which can be cached (static) from those which return dynamic (random) responses. Use some file naming rule or provide some custom http header or request parameter - i.e. any part of the request which can be used to judge whether to cache it or not.

Speeding up static requests. The idea is to process incoming requests and send back reply as early as possible (ideally even before a web server comes into play). I suggest you to use output caching since it will do what you intend to do in php&mysql internally in much more performant way. Some options are:

  1. Use IIS output caching feature (quick search shows it can cache queries basing on requested file name and query string).
  2. Place a caching layer in front of a web server. Varnish (https://www.varnish-cache.org/) is a flexible and powerful opensource tool, and you can configure caching strategy optimally depending on the size of your data (use memory vs. disk, how much mem can be used etc).

Speeding up dynamic requests. If they are completely random internally (no dll calls which can be cached) then there's not much to do. If there are some dll calls which can be cached, do it like you described: fetch data from cache, if it's there, you're good, if not, fetch it from dll and save to cache.

But use something more suitable for the task of caching - key/value storage like Redis or memcached are good. They are blazingly fast. Redis may be a better option since the data can be persisted to disk (while memcached drops entire cache on restart so it needs to be refilled).

Max Ivanov
  • 5,695
  • 38
  • 52
  • When using a Caching Reverse Proxy like varnish, keep in mind to use cache friendly (idempotent) HTTP methods. So use GET if you're not changing server-state. – Chris Wesseling Sep 02 '13 at 20:34
  • I thought about IIS output caching but it's no good, it only caches "hot" or frequently requested queries, which means a request which is not popular will not be cached despite the fact that it may be slow. I prefer to cache ALL requests. Varnish seems a good option, but I'm afraid it doesn't run on Windows – MirrorMirror Sep 02 '13 at 20:41
  • Also from varnish site: "Varnish Cache will typically look up a response by inspecting the HTTP Host header together with the URL. Varnish Cache maintains an index, a hash, with all the host+url combinations that are kept in the cache." What i understand from this is that it's cache is PER host, not universal. I would like a request done by user X to be served from cache for user Y. – MirrorMirror Sep 02 '13 at 20:42
  • HTTP Host from documentation is meant to be the host of the requested resource, not origin. It is needed when one varnish instance backs up one or more web servers with many hosts. – Max Ivanov Sep 02 '13 at 21:30
  • As for installing varnish on Windows, check it: https://www.varnish-cache.org/trac/wiki/Installation#Windows – Max Ivanov Sep 02 '13 at 21:33