-1

I have a matrix of data ( 55K X8.5k) with counts. Most of them are zeros, but few of them would be like any count. Lets say something like this:

 a  b  c
0  4  3  3
1  1  2  1
2  2  1  0
3  2  0  1
4  2  0  4

I want to binaries the cell values.

I did the following:

df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))

While the code works fine, but it takes a lot of time to run.

Why is that?

Is there a faster way?

Thanks

Edit:

Error when doing df.to_pickle

df_preference.to_pickle('df_preference.pickle')

I get this:

---------------------------------------------------------------------------
SystemError                               Traceback (most recent call last)
<ipython-input-16-3fa90d19520a> in <module>()
      1 # Pickling the data to the disk
      2 
----> 3 df_preference.to_pickle('df_preference.pickle')

\\dwdfhome01\Anaconda\lib\site-packages\pandas\core\generic.pyc in to_pickle(self, path)
   1170         """
   1171         from pandas.io.pickle import to_pickle
-> 1172         return to_pickle(self, path)
   1173 
   1174     def to_clipboard(self, excel=None, sep=None, **kwargs):

\\dwdfhome01\Anaconda\lib\site-packages\pandas\io\pickle.pyc in to_pickle(obj, path)
     13     """
     14     with open(path, 'wb') as f:
---> 15         pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
     16 
     17 

SystemError: error return without exception set
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Baktaawar
  • 7,086
  • 24
  • 81
  • 149
  • 1
    Please don't edit your question to include a new question. Post a new question instead. – ayhan May 31 '16 at 19:22

2 Answers2

3

UPDATE:

read this topic and this issue in regards to your error

Try to save your DF as HDF5 - it's much more convenient.

You may also want to read this comparison...

OLD answer:

try this:

In [110]: (df>0).astype(np.int8)
Out[110]:
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  0
3  1  0  1
4  1  0  1

.applymap() - one of the slowest method, because it goes to each cell (basically it performs nested loops inside).

df>0 works with vectorized data, so it does it much faster

.apply() - will work faster than .applymap() as it works on columns, but still much slower compared to df>0

UPDATE2: time comparison on a smaller DF (1000 x 1000), as applymap() will take ages on (55K x 9K) DF:

In [5]: df = pd.DataFrame(np.random.randint(0, 10, size=(1000, 1000)))

In [6]: %timeit df.applymap(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 3.75 s per loop

In [7]: %timeit df.apply(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 256 ms per loop

In [8]: %timeit (df>0).astype(np.int8)
100 loops, best of 3: 2.95 ms per loop
Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

You could use a scipy sparsematrix. This would make the calculations only relevant to the data that is actually there instead of operating on all the zeros.

Back2Basics
  • 7,406
  • 2
  • 32
  • 45