I'm trying to understand how PyTables manage data which size is greater than memory size. Here is comment in code of PyTables (link to GitHub):
# Nodes referenced by a variable are kept in `_aliveNodes`.
# When they are no longer referenced, they move themselves
# to `_deadNodes`, where they are kept until they are referenced again
# or they are preempted from it by other unreferenced nodes.
Also useful comments can be found inside _getNode method.
It seems like PyTables have very smart IO buffering system which, as I understand, stores data referenced by user in fast RAM as "aliveNodes", keeps referenced before and presently unreferenced data as "deadNodes" for fast "reviving" it when needed, and reads data from disk if requested key is not present in both dead or alive categories.
I need some expertise about how exactly PyTables handle situations when working with data larger then available memory. My specific questions:
- How deadNode/aliveNode system working (common picture)?
- What the key difference between aliveNodes/deadNodes while they both represent data stored in RAM if im right?
- Can limit of RAM for buffering be adjusted manually? Below the comment, there is code which reads a value from
params['NODE_CACHE_SLOTS']
. Can it be somehow specified by user? For example if I want to leave some RAM for other applications that need memory too? - In what situations PyTables can crash or significantly slowdown when working with big amount of data? In my case can exceed memory by 100 times, what are common pitfalls in such situations?
- What usage of PyTables in meaning of size, structure of data, and also manipulations with data considered as 'right' for achieving best performance?
- Docs suggests use
.flush()
after each basic.append()
cycle. How long this cycle actually can be? Im performing a little benchmark, comparing SQLite and PyTables in how they can handle creating a huge table with key-value pairs from big CSV files. And when I use.flush()
, less frequently in main cycle, PyTables gains huge speedup. So - is it correct, to.append()
relatively big chunks of data, and then use.flush()
?