5

I am trying to implement a very, very large dictionary search to match words in a sentence in PHP. My initial idea was to use the Aho-corasick algorithm, as the Aho-corasick solves my exact problem. I first implemented a Trie in PHP. The Trie, when cached, makes a sufficiently fast dictionary; however, it takes up approximately 3mb of memory. This is not going to scale well in PHP.

Obviously, no matter the data structure we use, a large dictionary will take up a lot of memory. I only need a single instance of the dictionary, since it is static and will not need to be rebuilt.

If this object could be shared between all threads, 3mb of memory is negligible, however, I am unsure of the proper way to share memory between threads in PHP.

How can I share this object between HTTP requests? I cannot see the project scaling when each thread requires 3mb overhead created just by the Trie.

Anthony Massie
  • 139
  • 1
  • 9
  • 2
    PHP generally doesn't do threading. `Trie` is also probably not enlish. Did you mean `Tree`. 3MB is *not much* for the average PHP process. – Evert Oct 07 '15 at 17:33
  • 7
    Trie is a proper noun describing a particular data structure https://en.wikipedia.org/wiki/Trie – Eric J. Oct 07 '15 at 17:36
  • http://stackoverflow.com/a/209799/1888402 "[There is no threading.] The next best thing would be to simply have one script execute another via CLI, but that's a bit rudimentary. Depending on what you are trying to do and how complex it is, this may or may not be an option." - Wilco – ZeroBased_IX Oct 07 '15 at 17:37
  • `PHP generally doesn't do threading` I think he means to share memory between different HTTP requests (which are probably processed on separate threads, though that is transparent to PHP). – Eric J. Oct 07 '15 at 17:37
  • @Anthony: Is the data entirely read-only once the Trie is constructed? – Eric J. Oct 07 '15 at 17:38
  • While there are multithreaded versions of php, typically it is not multithreaded unless you have a specific install. You could cache the data structure using some sort of memory cache, but that will still likely load a copy of the entire structure for each php call into the threads memory. How often does the trie change? Perhaps just reading it into memory from a serialized format would be just fine? – Jonathan Kuhn Oct 07 '15 at 17:39
  • [APC](http://php.net/manual/en/ref.apc.php) ***might*** be what you are looking for but don't hold me to it. – MonkeyZeus Oct 07 '15 at 17:40
  • http://stackoverflow.com/questions/5605656/share-variables-memory-between-all-php-processes – Eric J. Oct 07 '15 at 17:40
  • Also, perhaps a fast memory store such as redis would be better if you need quick access to a data structure that is independent of the php process (and so will will be available over different reloads and requests). – Jonathan Kuhn Oct 07 '15 at 17:42
  • Anthony, any reason that a DB is not fast enough or a good enough solution overall? I am not intimately familiar with Trie but it sounds like a DB could work. You can index the heck out of it if the data is going to remain static. Heck, even look into storing it on a RAM-DISK – MonkeyZeus Oct 07 '15 at 17:45
  • @Evert why comment before you check the facts? PHP "does" threading and trie is a data structure named after re**trie**val. Anyway, on topic - you can use ZeroMQ and connect threads via channels. You can have a single thread that serves as a sort of local storage from which other threads pull. Since ZMQ stores its data in memory, it will act as an in-memory store. What you need to do is keep the 3MB data structure available for ZMQ to send it around. PHP data structures are kept in memory so you don't have to screw around too much. Alternatively, there's always shared memory and serialization – N.B. Oct 07 '15 at 18:07
  • 3
    How big is your dictionary? Why cant you search it in a DB? How much scaling do you require? Is every hit on your website going to require this datastructure? Are you guilty of pre mature optimisation? – Toby Allen Oct 07 '15 at 18:18
  • Sorry guys. I edited and clarified my question. My exact problem is aho-corasick. Words have to be matched in a sentence. A database cannot be used. Yes, I meant http requests, and I was hoping for some kind of memory cache that can be shared between these HTTP requests. From recent metrics, the page has peaked at about 200 requests a second. – Anthony Massie Oct 07 '15 at 19:56
  • @N.B. While you can do threads with PHP, you really shouldn't ;). More on point, OP was talking about different PHP processes, not multiple threads inside a single PHP process. – Evert Oct 07 '15 at 20:56
  • @Evert - an opinion, no matter how "awesome" sounding in your head is still opinion and not a fact. You probably shouldn't do a lot of things, not just in IT but in life, which you do. Still, an in-memory solutions using PHP is trivial. A small daemon that responds either to signals or socket (ZeroMQ sounds appealing again) can completely do this job, can be implemented quickly and requires no extreme programming knowledge (plus there are examples available how to do it, literally copy paste). If anyone's opinion is not to use PHP for daemons, please don't highlight me while stating that :) – N.B. Oct 07 '15 at 21:30
  • @N.B. I agree that I could have better phrased 'doesn't do' as 'it's ridiculous to'. In the real, practical world of PHP programming, PHP threading has never left the state of 'experimental'. But if you want to be pedantic about it: yes PHP can do threads. Well done. – Evert Oct 07 '15 at 22:28
  • @Evert your programming honour stands defended, as a true programming knight would defend it. I take it we're done here, right? :) – N.B. Oct 07 '15 at 23:26
  • Can you share more details about the basic concept? 3MB overhead per thread does not sound like a lot – Nico Haase Jan 28 '21 at 15:27

4 Answers4

6

I wrote (forked from APC and maintain) APCu: A shared memory cache is not going to help you. Their internal storage area already has a defined structure, you can't change it. You can store your structure as objects, but these, and no other value, are actually shared between instances of PHP. Shared memory apc-like caches, copy out of shared memory for each context that requests the value.

I wrote pthreads (PHP extension): Threads are not going to help you. Just like APC having to copy out of shared memory, threads must.

PHP is shared nothing, all the time, or else you break stuff. You could write code that seemed as if it were sharing memory, but it wouldn't be; The rules must never be broken.

I don't think PHP a sensible target language if a primary requirement is efficiency, you seemed to recognize this by the end of your first paragraph. I can be wrong about that, but armed with all the facts above, I'd be surprised if you don't agree.

While it's not a sensible language, it's an arguably sensible platform. I'm going to assume that you want to use this in a web application context, and so are targeting PHP, but a much more sensible thing to do would be to implement the structures and algorithms in a suitable language, and expose it to your web application via an extension.

Suitable language generally means C or C++ for a PHP extension, but can mean others, if you are inventive enough.

You would still not be able to break the rules, but you wouldn't need to.

Obviously this relies on your ability to do those things.

Joe Watkins
  • 17,032
  • 5
  • 41
  • 62
  • 1
    "I wrote pthreads": *pthreads* is *POSIX threads* that you didn't write, did you? You wrote the pthreads **module** for PHP. – Déjà vu Nov 11 '15 at 07:08
0

I'm not entirely sure I fully get the need.

To share data between subsequent requests (of the same client), you can use sessions or some caching.

To share data between subsequent requests (not of the same client), you can use some caching (Redis?)

To share data between multiple threads (and pass/persist data between coroutines), you need to use a dedicated multithreading library (like Swoole) to get around the fact that PHP itself is single-threaded. Or to enable efficient multiplexing, use a PHP platform like RoadRunner instead of php-fpm.

Andrei Dascalu
  • 1,059
  • 2
  • 14
  • 26
-3

You can do multi threading in php using PThreads.

https://github.com/krakjoe/pthreads

It uses posix threads and offer synchronization, thread pools and read/write/executable support for threaded objects.

It runs on PHP7. Here is a program with two counters that runs asynchronously.

<?php
   $thread1 = new class extends Thread {
       public function run() {
           for ($i = 0; $i < 10000; $i++) {
              echo "Hello thread1 ($i)\n";
           }
       }
   };

   $thread2 = new class extends Thread {
       public function run() {
           for ($i = 0; $i < 10000; $i++) {
              echo "Hello thread2 ($i)\n";
           }
       }
   };

   $thread1->start() && $thread1->join();
   $thread2->start() && $thread2->join();
?>
Charlie
  • 22,886
  • 11
  • 59
  • 90
  • Multi-threading won't help as the OP also needs to persist the object in-memory between requests. PHP's design does not (by default) feature any persistence besides session-state which is serialized to disk (slow). There is no built-in provision for storing object-graphs in memory between requests. – Dai Oct 07 '15 at 18:35
  • @Dai - It's wrong for you to assume that the threading model should always be working across sessions. The model can also be such that multiple threads could be created inside a single session and handle three processes simultaneously. – Charlie Oct 08 '15 at 04:51
-3

Use files (php_get_content, file_put_content or change to a NodeJS server, PHP isn't made to share data between users...

roulioz
  • 19
  • 1
  • "PHP isn't made to share data between users" is plain wrong. Just add any kind of persistance layer, and PHP is pretty capable of sharing data between users – Nico Haase Jan 28 '21 at 15:25
  • This answer is out of topic, please read the question carefully – Matt Jan 28 '21 at 15:25
  • PHP isn't made to share data between "users"..., are you sure of that ? is "users" the correct word here ? – phoenixstudio Jan 28 '21 at 15:33