Context : I'm generating synthetic data in python that I will then store as a shelve object.
On top of this, I build my scoring model using Individual scoring PER category, Frequency item set mining & collaborative filtering - in the later two I need to be able to scan through similar categories of OTHER users, beside this one. This is why I choose to use a dict data-structure, so that the access to faster. Please point me if you see a better data-structure for this usecase.
My key idea is - after this prototype is done, this will be done for real users, so np.random.choice will not consume up about 98% of time, that it currently consumes now. Aside of that, how else can I make this faster.
I have also mentioned the ranges of the # items in list, to give you the context that the # users >> # epoch times / footprints per user.
The structure of the data is as follows -:
{
'U1': {
u'Vegetarian/Vegan': [1401572571,
7.850284461542393e-06],
u'Dumplings': [1402051698,
1.408092980963553e-05],
u'Gluten-free': [1400490069,
2.096967508708229e-06],
u'Lawyer': [1406549628,
0.0033942020684690575],
u'Falafel': [1409365870,
0.10524975778067258]
},
'U0': {
u'GasStation/Garage': [1394649440,
1.1279761215136923e-09],
u'MusicFestival': [1398401459,
1.0948922055743449e-07],
u'Chinese': [1408116633,
0.015294426605167652]
}}
The floating number that you see here after each epoch time, is that user's score in THAT category. which I write back after my scoring calculations (Code of the scoring now mentioned in the post)
More about the data -: I have a primary key called user U0, U1 etc. and a secondary key called "category" , here 'Vegetarian/Vegan' etc. Each of these secondary keys will have a list of 1 of more items. Because of this I need to draw 2 random numbers(without replacement, within low & high indexes. These items in turn are epoch times. Conceptually, it says, that a user U1, interacts with Vegetarian/Vegan at multiple epoch times, which I am storing in a list as a value for the category's key.
Say it was u'Vegetarian/Vegan': [1401572571] , then for each category, I calculate a score and write it back to the same shelve object, post the sythetic data generation. Here's a stripped down version of the code.
Question : I noticed that on a dataset of 5000 users, the shelving takes more then 6 hours just to create that shelve object. What am I doing wrong ? I need to be able to scale this up to about 50,000 users or more. I also did some prelim line & memory profiling, and I am attaching the profiling results on a set of 5 users.
import json,math,codecs,pickle
import numpy as np
from collections import defaultdict
import shelve
from contextlib import closing
global low,high,no_categories,low_epoch_time,high_epoch_time,epoch_time_range,no_users
basepath="/home/ekta/LatLongPrototype/FINALDUMP/"
low,high=6,15
no_categories=xrange(low,high+1)
low_epoch_time,high_epoch_time=1393200055,1409400055
epoch_time_range=xrange(low_epoch_time,high_epoch_time+1)
no_users=5000
global users
users=[]
global shelf_filehandle
shelf_filehandle=basepath+"shelf_contents"
def Synthetic_data_shelve(path, list_cat,list_epoch_time):
for j in xrange(len(list_cat)):
if not list_cat[j] in path.keys():
path[list_cat[j]] = [list_epoch_time[j]]
else :
path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
return path
def shelving():
dict_user = shelve.open(shelf_filehandle)
for i in xrange(no_users):
each_footprint=int(np.random.choice(no_categories, 1,replace=False))
list_cat=np.random.choice(sub_categories,each_footprint,replace=True)
list_epoch_time=np.random.choice(epoch_time_range,each_footprint,replace=False)
path =dict_user.get("U"+str(i), dict(defaultdict(dict)))
path=Synthetic_data_shelve(path, list_cat,list_epoch_time)
dict_user["U"+str(i)] = path
dict_user.close()
#To test this quickly consider, categories as,
sub_categories=["C"+str(i) for i in xrange(50)]
shelving()
What I tried so far -:
Profiling the program -:
Here are the results of line_profiling - I see that list_epoch_time=np.random.choice(epoch_time_range,each_footprint,replace=False) takes up 99.8% of time !
I can superficially try to define this as choice=np.random.choice
, but that did not give a substantially lower % time.
As mentioned before, the results below are for no_users=5 .
ekta@superwomen:~$ kernprof.py -l -v LatLong_shelving.py
Wrote profile results to LatLong_shelving.py.lprof
Timer unit: 1e-06 s
File: LatLong_shelving.py
Function: Synthetic_data_shelve at line 22
Total time: 0.000213 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
22 @profile
23 def Synthetic_data_shelve(path, list_cat,list_epoch_time):
24 46 49 1.1 23.0 for j in xrange(len(list_cat)):
25 41 88 2.1 41.3 if not list_cat[j] in path.keys():
26 19 28 1.5 13.1 path[list_cat[j]] = [list_epoch_time[j]]
27 else :
28 22 44 2.0 20.7 path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
29 5 4 0.8 1.9 return path
File: LatLong_shelving.py
Function: shelving at line 31
Total time: 32.008 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
31 @profile
32 def shelving():
33 1 4020 4020.0 0.0 dict_user = shelve.open(shelf_filehandle)
34 6 13 2.2 0.0 for i in xrange(no_users):
35 5 541 108.2 0.0 each_footprint=int(np.random.choice(no_categories, 1,replace=False))
36 5 226 45.2 0.0 list_cat=np.random.choice(sub_categories,each_footprint,replace=True)
37 5 31942152 6388430.4 99.8 list_epoch_time=np.random.choice(epoch_time_range,each_footprint,replace=False)
38 5 1074 214.8 0.0 path =dict_user.get("U"+str(i), dict(defaultdict(dict)))
39 5 360 72.0 0.0 path=Synthetic_data_shelve(path, list_cat,list_epoch_time)
40 5 3302 660.4 0.0 dict_user["U"+str(i)] = path
41 1 56352 56352.0 0.2 dict_user.close()
And, here's what I get on memory profiling.
How can we reduce the calls to "Synthetic_data_shelve" - If the entire logic for checking if list_cat[j] is in path.keys(), is dumped to "shelving", will it be faster. I obviously cant do reduce(Synthetic_data_shelve,path) since path is a dict and it is not permissible to reduce a dict. Also, both the i, j loops
in "Synthetic_data_shelve" and "shelving" would be a good candidate for reduce, since they are independent attributes PER user. How can I exploit this fact & more ?
ekta@superwomen:~$ python -m memory_profiler LatLong_shelving.py
Line # Mem usage Increment Line Contents
================================================
23 15.2 MiB 0.0 MiB @profile
24 def Synthetic_data_shelve(path, list_cat,list_epoch_time):
25 15.2 MiB 0.0 MiB for j in xrange(len(list_cat)):
26 15.2 MiB 0.0 MiB if not list_cat[j] in path.keys():
27 15.2 MiB 0.0 MiB path[list_cat[j]] = [list_epoch_time[j]]
28 else :
29 15.2 MiB 0.0 MiB path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
30 15.2 MiB 0.0 MiB return path
Filename: LatLong_shelving.py
Line # Mem usage Increment Line Contents
================================================
23 15.2 MiB 0.0 MiB @profile
24 def Synthetic_data_shelve(path, list_cat,list_epoch_time):
25 15.2 MiB 0.0 MiB for j in xrange(len(list_cat)):
26 15.2 MiB 0.0 MiB if not list_cat[j] in path.keys():
27 15.2 MiB 0.0 MiB path[list_cat[j]] = [list_epoch_time[j]]
28 else :
29 15.2 MiB 0.0 MiB path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
30 15.2 MiB 0.0 MiB return path
Filename: LatLong_shelving.py
Line # Mem usage Increment Line Contents
================================================
23 15.2 MiB 0.0 MiB @profile
24 def Synthetic_data_shelve(path, list_cat,list_epoch_time):
25 15.2 MiB 0.0 MiB for j in xrange(len(list_cat)):
26 15.2 MiB 0.0 MiB if not list_cat[j] in path.keys():
27 15.2 MiB 0.0 MiB path[list_cat[j]] = [list_epoch_time[j]]
28 else :
29 15.2 MiB 0.0 MiB path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
30 15.2 MiB 0.0 MiB return path
Filename: LatLong_shelving.py
Line # Mem usage Increment Line Contents
================================================
23 15.2 MiB 0.0 MiB @profile
24 def Synthetic_data_shelve(path, list_cat,list_epoch_time):
25 15.2 MiB 0.0 MiB for j in xrange(len(list_cat)):
26 15.2 MiB 0.0 MiB if not list_cat[j] in path.keys():
27 15.2 MiB 0.0 MiB path[list_cat[j]] = [list_epoch_time[j]]
28 else :
29 15.2 MiB 0.0 MiB path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
30 15.2 MiB 0.0 MiB return path
Filename: LatLong_shelving.py
Line # Mem usage Increment Line Contents
================================================
23 15.2 MiB 0.0 MiB @profile
24 def Synthetic_data_shelve(path, list_cat,list_epoch_time):
25 15.2 MiB 0.0 MiB for j in xrange(len(list_cat)):
26 15.2 MiB 0.0 MiB if not list_cat[j] in path.keys():
27 15.2 MiB 0.0 MiB path[list_cat[j]] = [list_epoch_time[j]]
28 else :
29 15.2 MiB 0.0 MiB path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
30 15.2 MiB 0.0 MiB return path
Filename: LatLong_shelving.py
Line # Mem usage Increment Line Contents
================================================
32 14.5 MiB 0.0 MiB @profile
33 def shelving():
34 14.6 MiB 0.1 MiB dict_user = shelve.open(shelf_filehandle)
35 15.2 MiB 0.7 MiB for i in xrange(no_users):
36 15.2 MiB 0.0 MiB each_footprint=int(np.random.choice(no_categories, 1,replace=False))
37 15.2 MiB 0.0 MiB list_cat=np.random.choice(sub_categories,each_footprint,replace=True)
38 15.2 MiB 0.0 MiB list_epoch_time=choice(epoch_time_range,each_footprint,replace=False)
39 15.2 MiB 0.0 MiB path =dict_user.get("U"+str(i), dict(defaultdict(dict)))
40 15.2 MiB 0.0 MiB path=Synthetic_data_shelve(path, list_cat,list_epoch_time)
41 15.2 MiB 0.0 MiB dict_user["U"+str(i)] = path
42 15.2 MiB 0.0 MiB dict_user.close()
Related - python populate a shelve object/dictionary with multiple keys