0

I am intended to make a program structure like below

Program Structure

PS1 is a python program persistently running. PC1, PC2, PC3 are client python programs. PS1 has a variable hashtable, whenever PC1, PC2... asks for the hashtable the PS1 will pass it to them.

The intention is to keep the table in memory since it is a huge variable (takes 10G memory) and it is expensive to calculate it every time. It is not feasible to store it in the hard disk (using pickle or json) and read it every time when it is needed. The read just takes too long.

So I was wondering if there is a way to keep a python variable persistently in the memory, so it can be used very fast whenever it is needed.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
Peter XU
  • 63
  • 15
  • 2
    Store it in a database? That's exactly what databases are used for. An alternative is to allocate shared memory for the variable and let other python processes to access it. – DYZ Jan 25 '17 at 05:36
  • 1
    Have you considered using a database? When you say "10 GB hashtable", my first thought is "MongoDB" (or similar key-value store setup). Passing around 10 GB hash tables seems wholly unnecessary. – ShadowRanger Jan 25 '17 at 05:39
  • 1
    http://stackoverflow.com/questions/6832554/python-multiprocessing-how-do-i-share-a-dict-among-multiple-processes – John Zwinck Jan 25 '17 at 05:45
  • http://stackoverflow.com/questions/9856196/sharing-a-variable-between-processes – Chandan Rai Jan 25 '17 at 05:49
  • @DYZ the HashTable maybe not the right way to refer the variable, it is a two-dimensional array. PCs need every value of it, so database seems not a good way since extracting every value from database is not fast to my understanding. – Peter XU Jan 25 '17 at 06:13
  • Then, shared memory (possibly through mempry-mapped files https://docs.python.org/3.5/library/mmap.html). – DYZ Jan 25 '17 at 06:15

2 Answers2

1

You are trying to reinvent a square wheel, when nice round wheels already exist!

Let's go one level up to how you have described your needs:

  • one large data set, that is expensive to build
  • different processes need to use the dataset
  • performance questions do not allow to simply read the full set from permanent storage

IMHO, we are exactly facing what databases were created for. For common use cases, having many processes all using their own copy of a 10G object is a memory waste, and the common way is that one single process have the data, and the others send requests for the data. You did not describe your problem enough, so I cannot say if the best solution will be:

  • a SQL database like PostgreSQL or MariaDB - as they can cache, if you have enough memory, all will be held automatically in memory
  • a NOSQL database (MongoDB, etc.) if your only (or main) need is single key access - very nice when dealing with lot of data requiring fast but simple access
  • a dedicated server using a dedicate query languages if your needs are very specific and none of the above solutions meet them
  • a process setting up a huge piece of shared memory that will be used by client processes - that last solution will certainly be fastest provided:
    • all clients make read-only accesses - it can be extended to r/w accesses but could lead to a synchronization nightmare
    • you are sure to have enough memory on your system to never use swap - if you do you will lose all the cache optimizations that real databases implement
    • the size of the database and the number of client process and the external load of the whole system never increase to a level where you fall in the swapping problem above

TL/DR: My advice is to experiment what are the performances with a good quality database and optionaly a dedicated chache. Those solution allow almost out of the box load balancing on different machines. Only if that does not work carefully analyze the memory requirements and be sure to document the limits in number of client processes and database size for future maintenance and use shared memory - read-only data being an hint that shared memory can be a nice solution

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

In short, to accomplish what you are asking about, you need to create a byte array as a RawArray from the multiprocessing.sharedctypes module that is large enough for your entire hashtable in the PS1 server, and then store the hashtable in that RawArray. PS1 needs to be the process that launches PC1, PC2, etc., which can then inherit access to the RawArray. You can create your own class of object that provides the hashtable interface through which the individual variables in the table are accessed that can be separately passed to each of the PC# processes that reads from the shared RawArray.

RichardB
  • 114
  • 9
  • _PS1 needs to be the process that launches PC1, PC2, etc._ Is there a way to make PC1,PC2.etc independent from PS1? When requesting the "table" it just like visit a website and "download" the table, but everything happened are in the memory? – Peter XU Jan 25 '17 at 06:09
  • What do you mean by "independent?" After PS1 launches PC1, PC2, etc., they will be completely separate processes. When I say PS1 launches the others, I'm just saying it'll have a line that says something like `p = process( ...; p.start( )`. This would be as opposed to you launching PS1 from the command line and then manually launching PC1 from the command line, or having some other process launch PS1 and then PC1. – RichardB Jan 27 '17 at 12:12