1

Questions modified slightly to improve understandability

My goal is to optimize a web application which has a very bad DB design and for which I can't touch the DB (I can't alter table nor introduce a new DB). I can operate on the code itself, on the filesystem or via a proxy. By "optimizing" I mean: reduce the requests sent to the webapp, as opposed to the ones sent to filesystem directly, keep DB queries to a minimum, reduce the number of different URL calls (keep caching in mind).

Let me try to construct a fictitious example, just to provide something to talk upon. Let us imagine this scenario:

  • I have a php webapp, which exposes a database of a million different people. each person decided at some point if they are happy or sad
  • when I visit person.php?id=x {x=1,..1000000}, the page creates a link to show_picture_of_person.php?id=x. show_picture_of_person.php will go inside the db of a million rows and this db will tell me if the person is sad or happy by returning an image. I don't know what this image is, unless I extract it from the DB. If I extract it from the db, i can then analyze it and understand if it's either a sad face or a happy face. The function show_picture_of_person actually outputs an image. Also the DB stores the image itself in a blob. Images are always either sad.jpg or happy.jpg.

what I would like to have, instead of a million link to show_picture_of_person.php?id=x, is to have 2 links, one for a sad.jpg and one for a happy.jpg. Possible solutions in my mind:

  1. I write a script which calls all the possible combinations of show_picture_of_person, save all the images, understand which are the ones that are equal and then write a lookup table. I put that lookup table in a php function make_sensible_url("show_picture_of_person.php?id=x") -> happy.jpg. This function will be called in the person.php script. I am worried about performance here of the php engine itself (and array like that would be by itself a 50+MB file!)
  2. Same as above, but instead of constructing an array in PHP, I create a filesystem of a million text files, and inside each text file I will have the name of the actual static file of the image (avoiding repetitions). The function make_sensible_url("show_picture_of_person.php?id=x") will simply read and output the content of the file. I like this approach as no DB calls and reading to fs should be faster then the huge PHP array in solution 1.
  3. I change person.php so that there is no link to show_picture_of_person.php and instead I make data:images. The issue with this is that, if I have x calls to person.php, I still have 2x calls to the DB (one for person.php and one for show_picture_of_person.php). I would like to have only x calls to the DB. Also this increases the size of the page (in my real case I have ~20 images in 1 page, so lots of queries and lots of bytes)
  4. don't know what else..

How would you solve this? Thanks!


For completeness' sake, here was the original question:

This is the scenario:

  • a database with various tables with data which is not properly indexed (let's say for this argument's sake that we have 5000 unique objects represented in around 50.000 rows - so duplicates are present)
  • we are in a situation in which the database is non modifiable (this also mean that we can't insert another table)
  • we have a php app exposing these objects
  • there exists around 1 million php calls (all legitimate) which return one of those 5000 objects (e.g.: bad_url.php?id=bar, bad_url.php?id=foo, ..)
  • there is no easy way to programmatically decide which of the 5000 objects will be returned

Our goal is to somehow convert the million+ calls to calls which will be giveme.php?id=x, where x is one of the 5000 unique objects.

Just to give you an idea of a theoretical approach:

  • we could index all the millions calls and map them with which distinct object is returned
  • we can create a hash table or something and create a php function which would work as give_me_nice_url("bad_url.php?....").
  • my feeling is that creating an array with such solution would result in a 50-100MB array .. not sure how performant it would be running realtime under load.

My ask for this question is which approach would you use to solve this issue and handle the large data set? Does there exists a better way than a lookup table like in my solution? Remember I can't use a database in the final production setup.

JoeSlav
  • 4,479
  • 4
  • 31
  • 50
  • "there is no easy way to programmatically decide which of the 5000 objects will be returned". How is that possible? Wouldn't each of the million PHP calls be programmatically deciding which objects to return? – Matt S Sep 08 '17 at 15:25
  • Or maybe I'm not understanding the question. What are "1 million php calls"? Do you mean a million possible URLs? You're trying to consolidate them somehow? – Matt S Sep 08 '17 at 15:26
  • First things first. Get rid of the duplicate rows: https://stackoverflow.com/questions/2594829/finding-duplicate-values-in-a-sql-table and then add some primary keys. Can you explain a bit more what you mean by "1 million php calls" and "bad_url.php?id=foo"? – waterloomatt Sep 08 '17 at 15:29
  • @waterloomatt: as I wrote above, i can't touch the DB and can't use a DB – JoeSlav Sep 09 '17 at 17:17
  • @Matt S: yes exactly I'm trying to consolidate urls. adding more details above... – JoeSlav Sep 09 '17 at 17:17
  • Thanks to all that will chime in! I've added an example which hopefully clarifies – JoeSlav Sep 09 '17 at 17:46
  • 1
    I don't really get the scenario, you have a database that you don't want / can't use, because you can't change it. However, your problem seems very much *made* for a database (indexing mostly), which databases are really good at. is it impossible to introduce a new database, that actually is indexed and holds only the most minimal information? also, is the problem with the image really that binary, either the person x is happy or unhappy (1 or 0) or are there actual images unique for all persons? if former, one file with `n` bits (each 1 or 0) with some clever concept might help... – Jakumi Sep 10 '17 at 22:39
  • @Jakumi: images are no unique, there are repetitions (a lot of them). I can't introduce a new DB. I can either modify the php code or filesystem. I've found another solution which will report above. Thanks! – JoeSlav Sep 11 '17 at 06:23

4 Answers4

2

Your option of having a file for each record would be easy, but also very bloated. If you know the image is one of 2 images, you could reduce your overhead by a large scale. Assume that you use a file that has 8 bit character encoding, each of those 8 bits can represent either happy (set to 1) or sad (set to 0). This gives you a reduction of 8 already. So for a million entries - your file will be up to 125K.

It's then a case of reading and writing individual bits to the file. A couple of small functions to do this and some test code...

<?php
error_reporting ( E_ALL );
ini_set ( 'display_errors', 1 );

function getState ( int $user )  {
    $fp = fopen ( "users.dat", "c+" );
    // Find the character in the file (position is user >> 4,
    // which is effectively / 8
    fseek($fp, $user >> 4, SEEK_SET );
    // Read the single char from the file
    $flagByte = fread($fp,1);
    // Extract the bit needed
    //  ord() converts a char to an ascii value ( 0-255)
    //  If the byte hasn't been read - use PHP 7 ?? to set it to a 0
    //  $user & 7 gets the bit position and shifts this bit to position 0
    //  & 1 extracts just this bit
    $flag = (ord($flagByte[0]??chr(0)) >> ($user & 7 )) & 1;
    fclose($fp);
    return $flag;
}

function setState ( int $user, bool $status )  {
    $fp = fopen ( "users.dat", "c+" );
    fseek($fp, $user >> 4, SEEK_SET );
    // Fetch the existing data
    $flagByte = fread($fp,1);
    $flagByte = ord($flagByte[0]??chr(0));
    // Get position of flag
    $flag = 1 << ($user & 7 );
    // Either set or unset the appropriate bit
    if ( $status )  {
        $flagByte |= $flag;
    }
    else    {
        $flagByte &= ~$flag;
    }
    fseek($fp, $user >> 4, SEEK_SET );
    fwrite($fp, chr($flagByte));
    fclose($fp);
}

setState(1, false);
setState(2, true);
setState(3, true);
setState(4, false);
setState(71, true);
setState(600100, false);
setState(600102, true);

echo "User: 1:".getState(1).PHP_EOL;
echo "User: 71:".getState(71).PHP_EOL;
echo "User: 600100:".getState(600100).PHP_EOL;
echo "User: 3:".getState(3).PHP_EOL;
echo "User: 600102:".getState(600102).PHP_EOL;
echo "User: 4:".getState(4).PHP_EOL;
echo "User: 871:".getState(871).PHP_EOL;
echo "User: 3:".getState(3).PHP_EOL;

I'm sure there are things you can improve on the code. Especially if it was put in a class, you can open the file and close the file once and not on each call. But if only one record is going to be tested, then it won't make much difference.

Update Assuming that you want to keep track of an image against a user, this method has a slower add cycle (as it checks if the image is already there) but access is a much more direct route. The concept uses 2 files, one for the list of image names and one for the image associated with the user. The main thing is when adding a new image, it checks if the image is already there and if so returns the position in the file of that image. If not found, it just adds it to the EOF. All names are just terminated by a PHP_EOL, so there is no limit to the image name or wasted space by allocating a fixed block. The user file just has a pointer into this image file, but (for simplicity) this is a 4 byte unsigned integer, so for a million users, this is 4MB - not that much.

function imageIndex ( string $addImage ): int {
    $images = fopen ( "images.dat", "c+" );
    while ( true )  {
        $pos = ftell($images);
        $image = fgets($images);
        if ( $image === false || rtrim($image, PHP_EOL) == $addImage )  {
            break;
        }
    }

    if ( $image === false ) {
        fseek($images, 0, SEEK_END);
        $pos = ftell($images);
        fwrite($images, $addImage.PHP_EOL);
    }
    fclose ( $images);
    return $pos;
}

function addUserImage ( int $userID, string $image )    {
    $users = fopen ( "users.dat", "c+" );
    // Fetch image location
    $image = imageIndex($image);
    // Locate user indicator (4 bytes per user)
    $loc = $userID << 2;
    fseek($users, $loc);
    // Write the location as an unsigned integer (4 bytes)
    fwrite($users, pack("L", $image));
    fclose ( $users);
}

function fetchUserImage ( int $userID ): string {
    $users = fopen ( "users.dat", "c+" );
    $images = fopen ( "images.dat", "c+" );
    // Locate user indicator
    $loc = $userID << 2;
    fseek($users, $loc);
    $imgRef = fread($users,4);
    // Convert the 4 chars to a PHP integer
    $imgLoc = unpack("Lloc", $imgRef);
    fseek($images, $imgLoc["loc"]);
    $image = fgets($images);
    fclose ( $users);
    fclose ( $images);

    return rtrim($image,PHP_EOL);
}

// Create 4000 users with some image
// for ( $i=0; $i<2000; $i++ )    {
//     addUserImage($i,"Image{$i}.jpg");
// }
// for ( $i=0; $i<2000; $i++ )    {
//     $ino = 2000 - $i;
//     addUserImage($i+2000,"Image{$ino}.jpg");
// }

// Fetch the image for 2000 users
for ( $i=0; $i < 4000; $i+=2) {
    echo "User {$i} image=".fetchUserImage($i).PHP_EOL;
}
Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
  • Thanks Nigel, I hear what you are saying and I like the fact that you also lean towards option 2 that I presented. In my real scenario I have around 10k unique images for over 1mil URLs, so I think I would still use 1 file per URL, would you agree? inside the file I will have the MD5 of the image which is then stored on disk with it's MD5 as filename. Your solution though it fine for my "example" of 2 images, I agree. Thanks. – JoeSlav Sep 13 '17 at 08:48
  • I would strongly recommend against having a million or so files. This makes server management very difficult. Under no (normal) circumstances should you try and store this many files in a single directory. Is your data fairly static? If so, there is an alternative which is more change intensive, but read friendly. – Nigel Ren Sep 13 '17 at 09:47
  • the million files are not stored in a single directory: there won't be more than 1000-2000 files per directory (in my solution i formed a tree structure to achieve that, all automagically with a script). that being said, the data is static but will grow over time (that is another reason why I like the files approach, I can keep adding to that directory with just a copy-paste). I'm all ears for your alternative & comments! thanks! – JoeSlav Sep 13 '17 at 09:50
  • Hey Nigel, thanks for updating the answer, which I think is completely relevant as per my "simplified problem" I outlined in my question. However do you agree that your solution works only if I have an ID for a person which is an integer? in my case i don't have such an ID, I have 10 parameters which make up the ID, so the easiest way i found to make an ID out of them was to combine them and hash them. in this situation would you agree my solution is simpler, or would you still try to find a way to combine the "indexing" in a single file? Thanks! – JoeSlav Sep 15 '17 at 08:10
  • in other words I would need a 16 bytes to identify 1 person and 16 bytes to return its physical address. and i can't just (I think, maybe i'm wrong) fseek thought a file .. i need to loop thought it .. ? – JoeSlav Sep 15 '17 at 08:14
  • Sorry - not sure about where the 16 bytes comes from. The only reason you would need to loop through the image file is if your adding a new image. When your fetching the image for a person, you fseek by their id in the user.dat file and read the 4 bytes, then fseek in the image file using those 4 bytes as the offset. – Nigel Ren Sep 15 '17 at 08:41
  • in the real world I don't have an ID of a person to be 1...100000 .. I have IDs which are 16 bytes of data (an md5). altough I am not disputing that your solution is fine with the happy/sad example, do you agree it would translate in the real world with IDs of 16 bytes? – JoeSlav Sep 15 '17 at 08:54
  • I was going by the part of your question where it says - when I visit person.php?id=x {x=1,..1000000}. – Nigel Ren Sep 15 '17 at 08:57
  • Yes, agreed, now, do you think therefore the millions small files would then be better than a big file which i need to loop thought? Not sure if I just can fseek anymore – JoeSlav Sep 15 '17 at 09:04
  • Surely your md5 key is just a hash of the image - not the user ID. So this provides the basis for an fseek. Also - millions of small files is not a problem, but you'll probably find the block size is 4k, so each file will take up 4k of space even if it only has 2 bytes in it. Not that this is a major problem, but also as the number of files you have increases - so does the time taken to find it, add in layers of directories and this again has a delay. – Nigel Ren Sep 15 '17 at 14:26
1

Now that I understand what you want to do, I think there are two logical cases.

One case is that you really just want to cache the "happy/sad" logic. In this case, I'd still suggest a Varnish or similar solution.

In this case, the HTML your app outputs doesn't change (<img src=show_picture_of_person.php?id=x>, presumably), and it's cached at every level - from browser through to your reverse proxy.

Modify your PHP code to set a really long TTL for the images coming from show_picture_of_person.php?id=x - depending on the business logic (do people go from happy to sad? do you need to invalidate the cache?), this could be hours, days, weeks or decades.

Then create a rule in Varnish to honour that TTL when handling requests for the images.

This way, the request hits your PHP code only when the item cannot be found in the cache.

The other option is that you want to move the logic "is person x happy or sad" away from retrieving an image through an expensive database call, and instead bake that into some other data storage mechanism, and your HTML code instead becomes <img src=happy.jpg>.

As you cannot store this in the database, you're looking at other options - effectively storing this in the filesystem, either as one huge array, or as lots of tiny files.

There are a couple of issues with this - the best place to store the data is in the database; separating this out into a different location creates maintenance and understandability challenges. One huge array doesn't really scale. Storing and accessing lots of tiny files is slow, and some filesystems limit the number of files in a directory, or slow down beyond a certain limit.

The biggest challenge with storing the data as files on the hard drive is that your application will become I/O bound from a scalability point of view. It's hard to say whether this will be a real problem - I'd run some fairly serious scalability testing using something like Apache JMeter.

My recommendation would be to use one of the many off-the-shelf PHP caching solutions, rather than bake your own. Which one depends on what you can install on your environment.

Neville Kuyt
  • 29,247
  • 1
  • 37
  • 52
  • Thanks for answering :) If I create a layer with a revproxy, varnish, or even a simple .htaccess (apart from the questionable performance of handling a million different cases) won't help me anyhow to publish new urls, as I don't know 'how' to create the new urls. I'll try to add some more info in the question above – JoeSlav Sep 09 '17 at 17:19
  • Thanks! Opt#1 is already implemented since years and it doesn't provide satisfactory results: the cache footprint is too large (I'm talking about caching outside of my DC as I want to decrease incoming traffic as well as improve perf). Opt#2 is exactly what I want to achieve. I agree that a huge array doesn't scale, thus I opted for many tiny files. given that I am not storing more than 2000 files per directory (i've made a tree-structured fs), would you agree this is fine in terms of performance? I agree that this is a convoluted solution, but I didn't find any simpler options..... Thanks! – JoeSlav Sep 15 '17 at 08:24
  • I've updated the answer. In general, I'd strongly recommend you use an off-the-shelf caching mechanism, rather than rolling your own. From a performance point of view, your approach is probably okay - but make sure you test it under load. – Neville Kuyt Sep 15 '17 at 08:37
  • Thanks. I am actually already caching the .php output of the pages (outside of DC) that will remain as is. I am not actually doing "my own caching system", that would be a misundestanding: the main purpose of my system is to create static files of the images and thus use instead of php calls. – JoeSlav Sep 15 '17 at 09:08
1

I would cache the results of show_picture_of_person.php?id=x to the file system, which is similar to your approach #2.

However you may want to consider using a caching library instead of rolling your own. Some frameworks like laravel come with caching options or you could use a 3rd party caching library instead such as https://github.com/tedious/Stash

Here's an example for stash:

// show_picture_of_person.php

if (!isset($_GET['image_path'])) {
    // Create Driver with default options
    $driver = new Stash\Driver\FileSystem(array());
    // Inject the driver into a new Pool object.
    $pool = new Stash\Pool($driver);
    // Get a cache item.
    $item = $pool->getItem('image_path_' . $_GET['id']);

    // Attempt to get the data
    $image_path = $item->get();

    // Check to see if the data was a miss.
    if($item->isMiss())
    {
        // Let other processes know that this one is rebuilding the data.
        $item->lock();

        // Run intensive code
        $image_path = codeThatTakesALongTime(); // save image to disk here

        if ($image_path) {
            // Store the expensive to generate data.
            $pool->save($item->set($image_path));
        }
    }
}
else {
    $image_path = $_GET['image_path'];
}

// Continue as normal.
useDataForStuff($image_path);

In person.php?id=x you can now check the cache above for key x and if it's populated then render show_picture_of_person.php?image_path=[sad/happy].jpg and if it's not populated then render show_picture_of_person.php?id=x, which will populate the cache once clicked.

FuzzyTree
  • 32,014
  • 3
  • 54
  • 85
  • Thank you for answering. If I understand correctly, you are saying to keep the show_picture_of_person.php?id=x links in my page and simply cache them on FS to avoid query to the DB .. am I right? – JoeSlav Sep 13 '17 at 07:54
  • @JoeSlav Correct but instead of only caching the image, you'll cache both the path and image. This will avoid caching duplicate image binaries – FuzzyTree Sep 13 '17 at 16:21
  • OK! Thanks. Just for completeness, this doesn't fix my main issue: all the million+ links. I want to avoid those, most of all! In other words, it's not enough to optimize with caching within my datacenter, I also want to optimize the internet traffic. Again, thanks – JoeSlav Sep 15 '17 at 08:03
  • @JoeSlav you can avoid duplicate urls by checking the cache that is created above in `person.php`. if there is a hit, then render a link where the argument is the `image_path` i.e. `show_picture_of_person.php?image_path=sad.jpg` otherwise the link will be `show_picture_of_person.php?id=x`. If only the id is passed, then the db will be queried and the cache populated – FuzzyTree Sep 15 '17 at 16:58
  • Thanks! I think this solution is exactly what I am doing in solution 2, with the difference of using an out-of-the-box library, plus your solution will create additional overhead with real time requests the first time there is a "miss-hit" (who cares). however my question is: how does Stash handle data? (looking at documentation now) – JoeSlav Sep 15 '17 at 18:14
  • @JoeSlav see http://www.stashphp.com/Drivers.html - the `dirSplit` argument will probably be useful to you since you will have a large cache pool. One benefit of using a library like stash is that you can easily migrate away from the filesystem if you ever need to in the future – FuzzyTree Sep 15 '17 at 18:21
  • Thanks. I must say this is intriguing. The cons of this solutions are: I will need to use a library, as opposed to one function of 10 lines in php; and there is a security issue here: i could potentially pollute the caching system by calls with random parameters. But this is intriguing for not having to do "pre-work" .. thinking ... – JoeSlav Sep 15 '17 at 18:29
  • Polluting the cache shouldn't be an issue. If an invalid parameter is provided, then `show_picture_of_person.php` will not find a picture and will not populate the cache – FuzzyTree Sep 15 '17 at 18:36
  • 1
    unfortunately, the way the legacy app works, it may return an empty image (legitimate) in case it doesn't find it, so I would never be able to distinguish if a call is legit and left empty handed or non legit and empty handed. However I realize my system will suffer from the same "disease". I'm positive I don't want to introduce a new library but the run-time saving of images is something that is intriguing. still thinking :) of course by doing an "offline periodic" work I will have the chance of making sure this is not exploding with "indexes" files. – JoeSlav Sep 15 '17 at 18:59
0

In some structured query language dialects there is the concept of a 'CURSOR'. Similar to return 10 or 50 search results per page.

If you use this then you can query for a suitable, or even a selectable, sized range of rows to show on a page, and also you can calculate arbitrary ranges of the same number of rows to present a list of links to browse through for the next or previous results.

Calculating the total number of such pages is more expensive.

bbaassssiiee
  • 6,013
  • 2
  • 42
  • 55