4

This is in relation to my post here but taken in a completely different direction Charset detection in PHP

essentially, i'm looking to reduce the memory that many huge arrays cause.

These arrays are just full of integers but seeing as PHP uses 32bit and 64 bit integers internally (depending which version you have compiled for your CPU type), it eats the memory.

is there a way to cheat PHP into using 8bit or 16bit integers?

I've thought about using pack(); to accomplish this so I can have an array of packed binary values and just unpack them as I need them (yes I know this would make it slower but is much faster than the alternative of loading and then running through each array individually as you can stream the text through so they all need to be in memory at the same time to keep speed up)

can you suggest any better alternatives to accomplish this? i know it's very hacky but I need to prevent huge memory surges.

Community
  • 1
  • 1
Jase
  • 599
  • 3
  • 9
  • 3
    This doesn't sound like a sensible strategy to reduce memory usage. What causes the surges in your scripts? Can you show some examples? Is it really necessary to load those huge arrays into memory all at once? What are you doing with them? – Pekka Mar 31 '11 at 19:22
  • 1
    Slightly helpful hint: Forget about PHP if you want to work with a huge set of numbers. Try out something more resource effective. (C/C++) – erenon Mar 31 '11 at 19:26
  • I agree with Pekka, PHP has pretty good memory management already, changing the bit length of your integers is not a good solution. Instead you should look to the actual cause, the huge arrays themselves. Maybe if you'd give more details on the operations you are performing, we could be of more assistance. – Unsigned Mar 31 '11 at 19:27
  • @erenon: True, but C/C++ is rarely a simple or viable option in the environments PHP is most often used (remote/web-server). – Unsigned Mar 31 '11 at 19:28
  • If you've had experience with chardet,the mozilla charset detection algorithm (mine is ported from the python version), you'll see that very large integer arrays are used. These arrays can easily go more than a few hundred indexes at a time. They are used for context analysis of each character set. I could test each character set seperately and unload each one when it's finished but this is slow. Plus, there is an option to feed the text through (useful for slight larger files) and in this instance, it's far quicker to have them in memory than keep reloading them each time text is streamed in – Jase Mar 31 '11 at 19:29
  • @erenon - Unfortunately, this is not an option. I'm restricted to PHP only – Jase Mar 31 '11 at 19:30
  • 4
    The real problem with having an array of integers in PHP is not the integer bit width, it's the arrays themselves. PHP doesn't have any. It has dictionaries. If you want to conserve memory, then use [SplFixedArray](http://php.net/splfixedarray) if you can. Else you will have to microoptimize: pack your array into a string, use four characters each to pack a hexstring. (You can only trade memory for speed in PHP). – mario Mar 31 '11 at 19:35
  • @mario Thank you for the suggestion. I did not know SplFixedArray is lower memory and that arrays would be the cause of the huge memory. I thought SplFixedArray was an extension of the standard array functionality. The hex string sounds like a very viable idea which I shall give a try. Thanks again. Anymore suggestions are more than welcome and I will try them all. Atm, loading all arrays uses 80mb of memory, so a slight speed drop will not be scorned :) – Jase Mar 31 '11 at 19:38
  • @mario: Post this comment as an answer. – erenon Mar 31 '11 at 19:42
  • I think it's actually feasible to implement a fake array which stores the data internally as such a string. Btw, one of the more interesting questions today. But maybe you will have to add a bounty to get a detailed answer.. (The lack of quick answers might be due to lack of examplary code to show.) – mario Mar 31 '11 at 19:43
  • @mario - That is an issue and my apologies for that. My code is currently at work which I can't access the moment. It is LGPL so it's not a licence restriction posting it. Can you please post as an answer so I can set it as an accepted solution. I've ran a quick test and it seems to reduce memory considerably with minimal performance hit (before it would retrieve the data by index number so a substr($context_array, $index*4, 4) is very efficient and effective) – Jase Mar 31 '11 at 19:50
  • Give me a minute, I'll give you an real answer.. – mario Mar 31 '11 at 19:52

1 Answers1

7

Don't tell nobody!

class IntegerstringArray IMPLEMENTS ArrayAccess {

    var $evil = "0000111122220000ffff";

                 // 16 bit each


    function offsetExists ( $offset ) {
        return (strlen($this->evil) / 4) - 1 >= $offset;
    }

    function offsetGet ( $offset ) {
        return hexdec(substr($this->evil, $offset * 4, 4));
    }

    function offsetSet ( $offset , $value ) {

        $hex = dechex($value);
        if ($fill = 4 - strlen($hex)) {
            $hex = str_repeat("0",  $fill) . $hex;
        }

        for ($i=0; $i<4; $i++) {
            $this->evil[$offset*4+$i] = $hex[$i];
        }
    }

    function offsetUnset ( $offset ) {
        assert(false);
    }

}

So you can pretty much create an array object from this:

$array = new IntegerstringArray();
$array[2] = 65535;
print $array[2];

It internally stores a list and accepts 16-bit integers. The array offsets must be consecutive.

Not tested. Just as an implementation guide.

Charles
  • 50,943
  • 13
  • 104
  • 142
mario
  • 144,265
  • 20
  • 237
  • 291
  • Haha, It's going into an open source library so unfortunately, i have to tell :P I will label it as a hack and give you due credit for the suggestion within the code. Thanks again, this has bugged me for 2 months :) – Jase Mar 31 '11 at 20:04
  • +1 for *evil*. Also fixed a typo for you discovered during testing. – Charles Mar 31 '11 at 20:10
  • @Charles: I would have purported that was an intentional typo to prevent people from actually using it. But alas, it was an ordinary typo :} Thanks! – mario Mar 31 '11 at 20:13
  • I imagine a version of this using `pack()` (as the OP suggested) would be even more memory efficient. You could still use the same sort of substring operations, and `unpack()` after extracting the relevant substring. But SplFixedArray, as suggested by mario, sounds like a much cleaner solution, in the end. – Frank Farmer Mar 31 '11 at 20:14
  • @Frank: That would be worth to benchmark for once. At the very least you could make it twice as memory efficient by using binary data instead of hex numbers. – mario Mar 31 '11 at 20:18
  • As a suggestion of my own which I just realised. If we imitate the fixed array functionality, we can set the internal string to a pre-determined length full of 0's. This way, the set wouldn't need to be consecutive. It doesn't matter in my current case but if someone else finds this useful,they may be able to add this in for themselves. – Jase Mar 31 '11 at 20:18
  • @mario, I'd say that as long as everyone's aware of the (minor) speed penalty of going through ArrayAccess, it then becomes a tradeoff between speed and what may or may no be wasted memory, at the cost of clever/evil. – Charles Mar 31 '11 at 20:18
  • @Jase: I don't know the interna of SplFixedArray. But I doubt it's more memory-efficient than this solution actually. Because this one is devoid of an index table, and does not associate the data via ZVALs. (SplFixedArray OTOH could hold strings or sub-arrays for example.) - You could pre-define the length of the array yourself, with str_repeat*4 on ->evil="". – mario Mar 31 '11 at 20:20
  • @Charles - The current code uses 80mb of RAM for a single page. I can filter this down to 30 by excluding processes that may be rare.. This is from a port of chardet to PHP (character set detection). The only performance hit would be the access to a particular index (as it doesn't loop through these arrays). Performance hit doesn't even matter as it's quite small compared to the amount of memory saved – Jase Mar 31 '11 at 20:23
  • @Charles. True. This is just for syntactic convenience. If it's worth to optimize for speed, then the raw methods and a central string would be optimal. Depends on how much @Jase had to rewrite for that.. – mario Mar 31 '11 at 20:24
  • @mario - I will run some benchmark's on the 300~ test files I have to see which is more efficient compared to memory and speed. But i can't do this right now because I don't have the files at hand. Either way, one or the other will be way better options than I had before – Jase Mar 31 '11 at 20:25
  • @Jase, this sounds *unexpectedly* sane then! Good luck. – Charles Mar 31 '11 at 20:29
  • 4
    I've just ran benchmarks on all 3 using 30,000 iterations. (realistic size for 16bit int, i know the max is around 32,000). array - the fastest but used 3.42mb splFixedArray - Very similar to array speed. 0.001 seconds between them but only used 1.69mb. (more than I was expecting. an iteration of 1 million used 50% the size of standard array) IntegerstringArray - by far the slowest. 10x slower than splFixedArry but uses a measley 0.43mb. This was on a 64bit build of PHP 5.3.3 on ubuntu 10.10. I think I'm going for the best of both worlds. User will be able to choose :) – Jase Mar 31 '11 at 21:34
  • as the above was iterating in a for loop (and setting the $i counter to the array), the speed includes the iteration speed. These arrays will be accessed by their index's rather than in an iteration so speed with be considerably faster in all methods :) – Jase Mar 31 '11 at 21:43
  • Wow, that's a pretty insane slowdown. – Charles Mar 31 '11 at 21:49
  • I wouldn't have expected that either. But OTOH it's fairly realistic. Transliterating C code into PHP makes it easily 10x slower. And for the simplistic substr() and string patching approach it's actually quite a good number :} – mario Mar 31 '11 at 21:54
  • 2
    Just ran another test. Using pack('v') instead of the hex conversion is the same speed and reduced the memory down even further to 0.38mb. I think i will supply 2 options to the user. Allow memory saving or allow performance. This will give the best of both worlds :) also, bare in mind, getting rather than setting was nearly twice as slow in all cases. 0.02 seconds in array and splfixedarray and 0.22 in integerstringArray (both pack and hex conversion). But this was also within a for loop so the loop time adds to it. Thanks again for all the help :D I can finally finish the project :D – Jase Mar 31 '11 at 22:01
  • As a comparison aswell. I just ran the loop empty to test base speed and base memory 0.33mb and 0.0028798580169678 seconds. This means the pack option only actually uses 0.05mb of memory :O whereas array actually uses 3.09mb. This hack has turned into pure genious @mario :D. ok, I'll stop annoying you with posts now. Just wanted to update on my findings. – Jase Mar 31 '11 at 22:13
  • @Jase: No. That's cool. Just don't you forget to also post a link to the finished project somewhen! If it's an exact port of a charset toolkit then I'd like to peek at that. – mario Mar 31 '11 at 22:19
  • The project will be called PHPChardet. At the moment, the code gives the exact same success rate as python chardet for detecting character encodings so it works but it just needs some PHP specialised treatment to hack it into shape :) It's currently on the svn servers at work but this will be moved to github when these mods are added and a few more performance ideas i have. I've also got to add all the licence info to give credit to the original authors of chardet and python chardet. I'm looking at another couple of weeks to get it finalised with last bits of testing too – Jase Mar 31 '11 at 22:32
  • My apologies for not posting this yet. It seems my work has dropped the project as they are now becoming more java and c# oriented for our websites (where ports of chardet already exist and are obviously much faster. They are changing for more reasons than this though). However, they have allowed me to keep it for myself. I'm integrating this into my class framework. A stupidly huge framework that does pretty much everything. This however means I have to work on it in my free time which I have very little of :( I will release this project though :) Just gonna take me a while lol – Jase May 15 '11 at 14:10
  • @Jase: How fast is `strlen()` in PHP? Does it store the length of the string or does it scan for a `\0` character? Would it help to save the length of the string in the class too, instead of calling `strlen()` in `offsetExists()`? – Jan Fabry Jun 14 '11 at 17:58
  • @JanFabry: Someone recently [declared `strlen` as a slow](http://www.xarg.org/2011/06/php-hacking/). Probing for string length with `isset($str[22225])` is a bit speedier. But at least PHP does *not* actually search for NUL bytes. The string length is available as php internal `zval` attribute for each variable/value. – mario Jun 14 '11 at 18:02