How to improve performance of a script operating on large amount of data?

Question

My machine learning script produces a lot of data (millions of BTrees contained in one root BTree) and store it in ZODB's FileStorage, mainly because all of it wouldn't fit in RAM. Script also frequently modifies previously added data.

When I increased the complexity of the problem, and thus more data needs to be stored, I noticed performance issues - script is now computing data on average from two to even ten times slower (the only thing that changed is amount of data to be stored and later retrieved to be changed).

I tried setting cache_size to various values between 1000 and 50000. To be honest, the differences in speed were negligible.

I thought of switching to RelStorage but unfortunately in the docs they mention only how to configure frameworks such as Zope or Plone. I'm using ZODB only.

I wonder if RelStorage would be faster in my case.

Here's how I currently setup ZODB connection:

import ZODB
connection = ZODB.connection('zodb.fs', ...)
dbroot = connection.root()

It's clear for me that ZODB is currently the bottleneck of my script. I'm looking for advice on how I could solve this problem.

I chose ZODB beacuse I thought that NoSQL database would better fit my case and I liked the idea of the interface similar to Python's dict.

Code and data structures:

root data structures:

if not hasattr(dbroot, 'actions_values'):
    dbroot.actions_values = BTree()

if not hasattr(dbroot, 'games_played'):
    dbroot.games_played = 0

actions_values is conceptually built as follows:

actions_values = { # BTree
    str(state): { # BTree
        # contiains actions (coulmn to pick to be exact, as I'm working on agent playing Connect 4)
        # and their values(only actions previously taken by the angent are present here), e.g.:
        1: 0.4356
        5: 0.3456
    },
    # other states
}

state is a simple 2D array representing game board. Possible vales of it's fields are 1, 2 or None:

board = [ [ None ] * cols for _ in xrange(rows) ]

(in my case rows = 6 and cols = 7)

main loop:

should_play = 10000000
transactions_freq = 10000
packing_freq = 50000

player = ReinforcementPlayer(dbroot.actions_values, config)

while dbroot.games_played < should_play:
    # max_epsilon at start and then linearly drops to min_epsilon:
    epsilon = max_epsilon - (max_epsilon - min_epsilon) * dbroot.games_played / (should_play - 1)

    dbroot.games_played += 1
    sys.stdout.write('\rPlaying game %d of %d' % (dbroot.games_played, should_play))
    sys.stdout.flush()

    board_state = player.play_game(epsilon)

    if(dbroot.games_played % transactions_freq == 0):
        print('Commiting...')
        transaction.commit()
    if(dbroot.games_played % packing_freq == 0):
        print('Packing DB...')
        connection.db().pack()

(packing also takes much time but it's not the main problem; I could pack database after program finishes)

Code operating on dbroot (inside ReinforcementPlayer):

def get_actions_with_values(self, player_id, state):
    if player_id == 1:
        lookup_state = state
    else:
        lookup_state = state.switch_players()
    lookup_state_str = str(lookup_state)
    if lookup_state_str in self.actions_values:
        return self.actions_values[lookup_state_str]
    mirror_lookup_state_str = str(lookup_state.mirror())
    if mirror_lookup_state_str in self.actions_values:
        return self.mirror_actions(self.actions_values[mirror_lookup_state_str])
    return None

def get_value_of_action(self, player_id, state, action, default=0):
    actions = self.get_actions_with_values(player_id, state)
    if actions is None:
        return default
    return actions.get(action, default)

def set_value_of_action(self, player_id, state, action, value):
    if player_id == 1:
        lookup_state = state
    else:
        lookup_state = state.switch_players()
    lookup_state_str = str(lookup_state)
    if lookup_state_str in self.actions_values:
        self.actions_values[lookup_state_str][action] = value
        return
    mirror_lookup_state_str = str(lookup_state.mirror())
    if mirror_lookup_state_str in self.actions_values:
        self.actions_values[mirror_lookup_state_str][self.mirror_action(action)] = value
        return
    self.actions_values[lookup_state_str] = BTree()
    self.actions_values[lookup_state_str][action] = value

(Functions with mirror in name simply reverse the columns (actions). It is done beacuse Connect 4 boards which are vertical reflections of each other are equivalent.)

After 550000 games len(dbroot.actions_values) is 6018450.

According to iotop IO operations take 90% of the time.

Your *data size* is the bottleneck. There's only so much the ZODB can do here. I'm not sure that RelStorage will help here; you'll need to create an adapter (pick one matching your chosen database from the [`adapters` package](https://github.com/zodb/relstorage/tree/master/relstorage/adapters)) and pass it to a [`RelStorage()` instance](https://github.com/zodb/relstorage/blob/master/relstorage/storage.py#L78); use that instance instead of a `FileStorage` object. — Martijn Pieters, Jan 04 '16 at 18:56
@MartijnPieters What could help here? It's the first time I have to deal with that big amount of data. Is there any other soultion which would fit better? I'm a little overwhelmed by the perspective of waiting a week for my alghoritm to finish. :) — Luke, Jan 04 '16 at 19:03
Does/Can your data squeeze into a relational model? If so, put it in a relational database, ideally one that scales well with lots of writes. If the answer is yes, the remaining question is whether or not you are gaining anything at all from the ZODB, which is best for graphs of objects, particularly when permissions are granular. — SteveM, Jan 04 '16 at 19:31
This is certainly more a data-structure/algorithm problem than a database problem (though indirectly, it is). You need to figure out how to balance what is persistent and in-memory without merely relying on LRU cache. Also, you have not described if performance is scaling according to your expected complexity -- e.g. is your algorithm O(n) when run on a set that can fit entirely in memory, or more complex? I doubt that the complexity of the problem has changed, only the size of the problem set. — sdupton, Jan 04 '16 at 20:57
@sdupton The problem complexity isn't directly related to the size of already existing data (I just need to get the value for given key). However, in current implementation fetching data from DB is probably logharitmic, since data is stored in binary trees. — Luke, Jan 04 '16 at 21:05

score 2 · Answer 1 · answered Jan 04 '16 at 19:15

2

Using any (other) database would probably not help, as they are subject to same disk IO and memory limitations as ZODB. If you manage to offload computations to the database engine itself (PostgreSQL + using SQL scripts) it might help, as the database engine would have more information to make intelligent choices how to execute the code, but there is nothing magical here and same things can be most likely done with ZODB with quite ease.

Some ideas what can be done:

Have indexes of data instead of loading full objects (equal to SQL "full table scan"). Keep intelligent preprocesses copies of data: indexes, sums, partials.
Make the objects themselves smaller (Python classes have __slots__ trick)
Use transactions in intelligent fashion. Don't try to process all data in a single big chunk.
Parallel processing - use all CPU cores instead of single threaded approach
Don't use BTrees - maybe there is something more efficient for your use case

Having some code samples of your script, actual RAM and Data.fs sizes, etc. would help here to give further ideas.

answered Jan 04 '16 at 19:15

Mikko Ohtamaa

82,057
50
264
435

1

What I would do is to look if you are CPU bound or IO bound. E.g. watch `htop` or `top` during the script execution. If you are CPU bound I'd split the main loop, so that it uses several threads or processes to churn through the data. If you are IO bound (I doubt) you could do sharding a separate ZODB to several physical disks (computers). I understood (all) games are not dependent on each other? – Mikko Ohtamaa Jan 05 '16 at 06:41
1

@miko-ohtaama You doubt, but it's really true. I were CPU bound when problem size was smaller. Then `System Activity` in KDE showed 25% which I think means max. on one core). Now it shows circa 15% or less. – Luke Jan 05 '16 at 07:52
@Luke: Maxing out one core of four is my guess which means single thread behavior is the limiting factor. (htop shows CPU usage per core). Try parallerize your main loop, either using lib like this https://pythonhosted.org/joblib/parallel.html or exploring Python stdlib threading and multiprocess modules. – Mikko Ohtamaa Jan 05 '16 at 09:22
1

@mikko-ohtamma I'm afraid you got me wrong. In current situation `python2` process isn't anywhere near maxing out one core. For me that means IO operations cause slowdown. – Luke Jan 05 '16 at 10:00
Can you confirm this by also checking IO busy % using your favorite utility? – Mikko Ohtamaa Jan 05 '16 at 13:34
1

@miko-ohtamaa According to `iotop` IO operations take 90% of the time. – Luke Jan 05 '16 at 19:16
1

@Luke - mmmm. Then it's the disk access which is limiting factor. I would try following 1) buy more ram 2) buy faster disks 3) optimize code so that data is more organized and "chunks" of data is processed at once with less random access to the disk – Mikko Ohtamaa Jan 05 '16 at 20:35
Can you group / shard data somehow instead of having it in a random btree, so that a working set always fits into RAM? – Mikko Ohtamaa Jan 05 '16 at 20:41
The alghoritm which would achieve that would be itself an optimisation task. – Luke Jan 05 '16 at 21:38
1

Some more ideas I came up with and which can give some extended lifetime before algo changes: Use micro-optimizations to make your objects smaller, see example here. `__setstate__`, `__getstate__` and `__slots__`. Good example is Zope DateTime: https://github.com/zopefoundation/DateTime/blob/master/src/DateTime/DateTime.py#L449 – Mikko Ohtamaa Jan 06 '16 at 06:03
1

Because disk IO is limiting factor it might make sense to compress objects so there are less bytes to write and read https://pypi.python.org/pypi/zc.zlibstorage – Mikko Ohtamaa Jan 06 '16 at 06:04
1

@miko-ohtamaa After conversion to zlibstorage it runs faster, but I'm sot sure if it's fast enough. I'll probably change the implementation to neural net approximation. If you're interested, IO usage dropped to around 60%. – Luke Jan 06 '16 at 20:23

score 1 · Answer 2 · answered Jan 06 '16 at 22:15

Just to be clear here, which BTree class are you actually using? An OOBTree?

Two aspects about those btrees:

1) Each BTree is composed of a number of Buckets. Each Bucket will hold a certain number of items before being split. I can't remember how many items they hold currently, but I did once try tweaking the C-code for them and recompile to hold a larger number as the value chosen was chosen nearly two decades ago.

2) It is sometime possible to construct very un-balanced Btrees. e.g. if you add values in sorted order (e.g. a timestamp that only ever increases) then you will end up with a tree that ends up being O(n) to search. There was a script written by the folks at Jarn a number of years ago that could rebalance the BTrees in Zope's Catalog, which might be adaptable for you.

3) Rather than using an OOBTree you can use an OOBucket instead. This will end up being just a single pickle in the ZODB, so may end up too big in your use case, but if you are doing all the writes in a single transaction than it may be faster (at the expense of having to re-write the entire Bucket on an update).

-Matt

How to improve performance of a script operating on large amount of data?

2 Answers2

Linked