Optimizing reads & writes in a shelve object to scale up for larger dataset

Question

Context : I'm generating synthetic data in python that I will then store as a shelve object.

On top of this, I build my scoring model using Individual scoring PER category, Frequency item set mining & collaborative filtering - in the later two I need to be able to scan through similar categories of OTHER users, beside this one. This is why I choose to use a dict data-structure, so that the access to faster. Please point me if you see a better data-structure for this usecase.

My key idea is - after this prototype is done, this will be done for real users, so np.random.choice will not consume up about 98% of time, that it currently consumes now. Aside of that, how else can I make this faster.

I have also mentioned the ranges of the # items in list, to give you the context that the # users >> # epoch times / footprints per user.

The structure of the data is as follows -:

    {
        'U1': {
            u'Vegetarian/Vegan': [1401572571,
            7.850284461542393e-06],
            u'Dumplings': [1402051698,
            1.408092980963553e-05],
            u'Gluten-free': [1400490069,
            2.096967508708229e-06],
            u'Lawyer': [1406549628,
            0.0033942020684690575],
            u'Falafel': [1409365870,
            0.10524975778067258]
        },
        'U0': {
            u'GasStation/Garage': [1394649440,
            1.1279761215136923e-09],
            u'MusicFestival': [1398401459,
            1.0948922055743449e-07],
            u'Chinese': [1408116633,
            0.015294426605167652]
}}

The floating number that you see here after each epoch time, is that user's score in THAT category. which I write back after my scoring calculations (Code of the scoring now mentioned in the post)

More about the data -: I have a primary key called user U0, U1 etc. and a secondary key called "category" , here 'Vegetarian/Vegan' etc. Each of these secondary keys will have a list of 1 of more items. Because of this I need to draw 2 random numbers(without replacement, within low & high indexes. These items in turn are epoch times. Conceptually, it says, that a user U1, interacts with Vegetarian/Vegan at multiple epoch times, which I am storing in a list as a value for the category's key.

Say it was u'Vegetarian/Vegan': [1401572571] , then for each category, I calculate a score and write it back to the same shelve object, post the sythetic data generation. Here's a stripped down version of the code.

Question : I noticed that on a dataset of 5000 users, the shelving takes more then 6 hours just to create that shelve object. What am I doing wrong ? I need to be able to scale this up to about 50,000 users or more. I also did some prelim line & memory profiling, and I am attaching the profiling results on a set of 5 users.

import json,math,codecs,pickle
import numpy as np
from collections import defaultdict
import shelve
from contextlib import closing


global low,high,no_categories,low_epoch_time,high_epoch_time,epoch_time_range,no_users
basepath="/home/ekta/LatLongPrototype/FINALDUMP/"
low,high=6,15
no_categories=xrange(low,high+1)
low_epoch_time,high_epoch_time=1393200055,1409400055
epoch_time_range=xrange(low_epoch_time,high_epoch_time+1)
no_users=5000
global users
users=[]
global shelf_filehandle
shelf_filehandle=basepath+"shelf_contents"




def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    for j in xrange(len(list_cat)):
        if not list_cat[j] in path.keys():
            path[list_cat[j]] = [list_epoch_time[j]]
        else  :
            path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    return path

def shelving():
    dict_user = shelve.open(shelf_filehandle)
    for i in xrange(no_users):
        each_footprint=int(np.random.choice(no_categories, 1,replace=False))
        list_cat=np.random.choice(sub_categories,each_footprint,replace=True)
        list_epoch_time=np.random.choice(epoch_time_range,each_footprint,replace=False)
        path =dict_user.get("U"+str(i), dict(defaultdict(dict)))
        path=Synthetic_data_shelve(path, list_cat,list_epoch_time)
        dict_user["U"+str(i)] = path
    dict_user.close()


#To test this quickly consider, categories as, 
sub_categories=["C"+str(i) for i in xrange(50)] 
shelving()

What I tried so far -:

Profiling the program -:

Here are the results of line_profiling - I see that list_epoch_time=np.random.choice(epoch_time_range,each_footprint,replace=False) takes up 99.8% of time !

I can superficially try to define this as choice=np.random.choice, but that did not give a substantially lower % time.

As mentioned before, the results below are for no_users=5 .

ekta@superwomen:~$ kernprof.py -l -v  LatLong_shelving.py
Wrote profile results to LatLong_shelving.py.lprof
Timer unit: 1e-06 s

File: LatLong_shelving.py
Function: Synthetic_data_shelve at line 22
Total time: 0.000213 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    22                                           @profile
    23                                           def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    24        46           49      1.1     23.0      for j in xrange(len(list_cat)):
    25        41           88      2.1     41.3          if not list_cat[j] in path.keys():
    26        19           28      1.5     13.1              path[list_cat[j]] = [list_epoch_time[j]]
    27                                                   else  :
    28        22           44      2.0     20.7              path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    29         5            4      0.8      1.9      return path

File: LatLong_shelving.py
Function: shelving at line 31
Total time: 32.008 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    31                                           @profile
    32                                           def shelving():
    33         1         4020   4020.0      0.0      dict_user = shelve.open(shelf_filehandle)
    34         6           13      2.2      0.0      for i in xrange(no_users):
    35         5          541    108.2      0.0          each_footprint=int(np.random.choice(no_categories, 1,replace=False))
    36         5          226     45.2      0.0          list_cat=np.random.choice(sub_categories,each_footprint,replace=True)
    37         5     31942152 6388430.4     99.8          list_epoch_time=np.random.choice(epoch_time_range,each_footprint,replace=False)
    38         5         1074    214.8      0.0          path =dict_user.get("U"+str(i), dict(defaultdict(dict)))
    39         5          360     72.0      0.0          path=Synthetic_data_shelve(path, list_cat,list_epoch_time)
    40         5         3302    660.4      0.0          dict_user["U"+str(i)] = path
    41         1        56352  56352.0      0.2      dict_user.close()

And, here's what I get on memory profiling.

How can we reduce the calls to "Synthetic_data_shelve" - If the entire logic for checking if list_cat[j] is in path.keys(), is dumped to "shelving", will it be faster. I obviously cant do reduce(Synthetic_data_shelve,path) since path is a dict and it is not permissible to reduce a dict. Also, both the i, j loops in "Synthetic_data_shelve" and "shelving" would be a good candidate for reduce, since they are independent attributes PER user. How can I exploit this fact & more ?

ekta@superwomen:~$ python -m memory_profiler LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    23     15.2 MiB      0.0 MiB   @profile
    24                             def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    25     15.2 MiB      0.0 MiB       for j in xrange(len(list_cat)):
    26     15.2 MiB      0.0 MiB           if not list_cat[j] in path.keys():
    27     15.2 MiB      0.0 MiB               path[list_cat[j]] = [list_epoch_time[j]]
    28                                     else  :
    29     15.2 MiB      0.0 MiB               path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    30     15.2 MiB      0.0 MiB       return path


Filename: LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    23     15.2 MiB      0.0 MiB   @profile
    24                             def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    25     15.2 MiB      0.0 MiB       for j in xrange(len(list_cat)):
    26     15.2 MiB      0.0 MiB           if not list_cat[j] in path.keys():
    27     15.2 MiB      0.0 MiB               path[list_cat[j]] = [list_epoch_time[j]]
    28                                     else  :
    29     15.2 MiB      0.0 MiB               path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    30     15.2 MiB      0.0 MiB       return path


Filename: LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    23     15.2 MiB      0.0 MiB   @profile
    24                             def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    25     15.2 MiB      0.0 MiB       for j in xrange(len(list_cat)):
    26     15.2 MiB      0.0 MiB           if not list_cat[j] in path.keys():
    27     15.2 MiB      0.0 MiB               path[list_cat[j]] = [list_epoch_time[j]]
    28                                     else  :
    29     15.2 MiB      0.0 MiB               path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    30     15.2 MiB      0.0 MiB       return path


Filename: LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    23     15.2 MiB      0.0 MiB   @profile
    24                             def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    25     15.2 MiB      0.0 MiB       for j in xrange(len(list_cat)):
    26     15.2 MiB      0.0 MiB           if not list_cat[j] in path.keys():
    27     15.2 MiB      0.0 MiB               path[list_cat[j]] = [list_epoch_time[j]]
    28                                     else  :
    29     15.2 MiB      0.0 MiB               path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    30     15.2 MiB      0.0 MiB       return path


Filename: LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    23     15.2 MiB      0.0 MiB   @profile
    24                             def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    25     15.2 MiB      0.0 MiB       for j in xrange(len(list_cat)):
    26     15.2 MiB      0.0 MiB           if not list_cat[j] in path.keys():
    27     15.2 MiB      0.0 MiB               path[list_cat[j]] = [list_epoch_time[j]]
    28                                     else  :
    29     15.2 MiB      0.0 MiB               path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    30     15.2 MiB      0.0 MiB       return path


Filename: LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    32     14.5 MiB      0.0 MiB   @profile
    33                             def shelving():
    34     14.6 MiB      0.1 MiB       dict_user = shelve.open(shelf_filehandle)
    35     15.2 MiB      0.7 MiB       for i in xrange(no_users):
    36     15.2 MiB      0.0 MiB           each_footprint=int(np.random.choice(no_categories, 1,replace=False))
    37     15.2 MiB      0.0 MiB           list_cat=np.random.choice(sub_categories,each_footprint,replace=True)
    38     15.2 MiB      0.0 MiB           list_epoch_time=choice(epoch_time_range,each_footprint,replace=False)
    39     15.2 MiB      0.0 MiB           path =dict_user.get("U"+str(i), dict(defaultdict(dict)))
    40     15.2 MiB      0.0 MiB           path=Synthetic_data_shelve(path, list_cat,list_epoch_time)
    41     15.2 MiB      0.0 MiB           dict_user["U"+str(i)] = path
    42     15.2 MiB      0.0 MiB       dict_user.close()

Related - python populate a shelve object/dictionary with multiple keys

I don't have time to read this whole thing right now, but the obvious first question is: Why do you think `shelve` is the right data structure here? If you're looking to store hierarchical data of more than 2 levels, the second-level dicts have to be unpickled and repickled for every change to any of their keys or values. Would a relational database, document store, or old-school hierarchical database fit better? — abarnert, Sep 04 '14 at 08:53
@abarnert - If I need to build statistical models on top of this data(in Python) - would RDBMS/ other alternatives fit better ? Sorry, If that sounds very naive, but I haven't handled a use-case of this scale(which I needed to store as hash-maps, to maintain associations), which is why asking. So, for instance, I LOAD all the keys in Redis(say) - and read from there, or use MongoDB to scale this. My primary question is with 5000 users, it is unwarranted that this should take more than 6 hours just to create the shelf file. Thank you for taking the time to read. — ekta, Sep 04 '14 at 09:22

Optimizing reads & writes in a shelve object to scale up for larger dataset

0 Answers0

Linked