Python Applymap taking time to run

Question

I have a matrix of data ( 55K X8.5k) with counts. Most of them are zeros, but few of them would be like any count. Lets say something like this:

I want to binaries the cell values.

I did the following:

df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))

While the code works fine, but it takes a lot of time to run.

Why is that?

Is there a faster way?

Thanks

Edit:

Error when doing df.to_pickle

df_preference.to_pickle('df_preference.pickle')

I get this:

---------------------------------------------------------------------------
SystemError                               Traceback (most recent call last)
<ipython-input-16-3fa90d19520a> in <module>()
      1 # Pickling the data to the disk
      2 
----> 3 df_preference.to_pickle('df_preference.pickle')

\\dwdfhome01\Anaconda\lib\site-packages\pandas\core\generic.pyc in to_pickle(self, path)
   1170         """
   1171         from pandas.io.pickle import to_pickle
-> 1172         return to_pickle(self, path)
   1173 
   1174     def to_clipboard(self, excel=None, sep=None, **kwargs):

\\dwdfhome01\Anaconda\lib\site-packages\pandas\io\pickle.pyc in to_pickle(obj, path)
     13     """
     14     with open(path, 'wb') as f:
---> 15         pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
     16 
     17 

SystemError: error return without exception set

Please don't edit your question to include a new question. Post a new question instead. — ayhan, May 31 '16 at 19:22

score 3 · Answer 1 · edited May 23 '17 at 12:19

3

UPDATE:

read this topic and this issue in regards to your error

Try to save your DF as HDF5 - it's much more convenient.

You may also want to read this comparison...

OLD answer:

try this:

In [110]: (df>0).astype(np.int8)
Out[110]:
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  0
3  1  0  1
4  1  0  1

.applymap() - one of the slowest method, because it goes to each cell (basically it performs nested loops inside).

df>0 works with vectorized data, so it does it much faster

.apply() - will work faster than .applymap() as it works on columns, but still much slower compared to df>0

UPDATE2: time comparison on a smaller DF (1000 x 1000), as applymap() will take ages on (55K x 9K) DF:

In [5]: df = pd.DataFrame(np.random.randint(0, 10, size=(1000, 1000)))

In [6]: %timeit df.applymap(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 3.75 s per loop

In [7]: %timeit df.apply(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 256 ms per loop

In [8]: %timeit (df>0).astype(np.int8)
100 loops, best of 3: 2.95 ms per loop

edited May 23 '17 at 12:19

Community

1
1

answered May 31 '16 at 18:50

MaxU - stand with Ukraine

205,989
36
386
419

Cool. Is it possible to do this with some other method like apply?. Also your method won't transform the df inplace right? I would need to assign it back to new data frame then? – Baktaawar May 31 '16 at 18:59
@Baktaawar, can you explain what exactly do you want to achieve? – MaxU - stand with Ukraine May 31 '16 at 19:00
Same as I mentioned. Each cell value if it more than 0 then make it 1 else 0. The cell values range from >=0. Basically I am binarising the matrix values – Baktaawar May 31 '16 at 19:02
@Baktaawar, what is wrong with `(df>0).astype(np.int8)`? – MaxU - stand with Ukraine May 31 '16 at 19:03
No nothing wrong. I am just educating myself if there are other functions apart from applymap which work faster? – Baktaawar May 31 '16 at 19:03
i can reccomend [this link](http://pandas.pydata.org/pandas-docs/stable/api.html) for educational purposes – MaxU - stand with Ukraine May 31 '16 at 19:06
Thanks. One quick question. what if I want to pickle the data frame. But in separate step. Like not combining .astype(int) with to_pickle. I did this df_preference.to_pickle('df_preference.pickle') gave an error – Baktaawar May 31 '16 at 19:07
@Baktaawar, ` df_preference.to_pickle('df_preference.pickle')` - should work - I've mentioned it when I answered your last question. What kind of error do you have? – MaxU - stand with Ukraine May 31 '16 at 19:09
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/113454/discussion-between-maxu-and-baktaawar). – MaxU - stand with Ukraine May 31 '16 at 19:47

score 0 · Answer 2 · answered May 31 '16 at 19:01

0

You could use a scipy sparsematrix. This would make the calculations only relevant to the data that is actually there instead of operating on all the zeros.

answered May 31 '16 at 19:01

Back2Basics

7,406
2
32
45

Python Applymap taking time to run

2 Answers2

Linked