0

consider the the pandas series s

n = 1000
s = pd.Series([0] * n + [1] * n, dtype=int)

s.memory_usage()

8080

I can "sparsify" this by using to_sparse

s.to_sparse(fill_value=0).memory_usage()

4080

But I only have 2 types of integers. I'd think I could sparsify twice. Is there a way to do this?

Mahesh Khond
  • 1,297
  • 1
  • 14
  • 31
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    I don't know how `pd` sparse stores its data. I'm guessing from these numbers that it collects the indices of all the 1s. A `scipy.sparse matrix` would store both the indices and the data (all the 1s). A scipy version has to have a 1% sparsity to see much advantage in memory and calculation speed. – hpaulj Nov 14 '16 at 18:56
  • 2
    what about `s.astype(np.uint8).to_sparse()`? – MaxU - stand with Ukraine Nov 14 '16 at 18:56
  • @MaxU that definitely drops the memory footprint. But it could drop it even further if it only tracked the positions of the zeros and ones. – piRSquared Nov 14 '16 at 18:59
  • Are you looking for a sparse array/matrix or some form of compression? One that replaces runs of 1s and 0s with a single count? – hpaulj Nov 14 '16 at 20:29
  • @hpaulj at the moment, I'm struggling to see the difference. As far as I understand, sparse is compression where we keep enough integrity to do operations quickly – piRSquared Nov 14 '16 at 20:31
  • 1
    I looked a bit at the pandas sparse docs. It looks like will handle blocked sparsity better than `scipy`, but it still stores all the non-fill values. – hpaulj Nov 14 '16 at 21:23

2 Answers2

3

Since you tagged this with scipy, I'll show you what a scipy.sparse matrix is like:

In [31]: n=100
In [32]: arr=np.array([[0]*n+[1]*n],int)
In [33]: M=sparse.csr_matrix(arr)
In [34]: M.data
Out[34]: 
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [35]: M.indices
Out[35]: 
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
       113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
       126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
       139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
       152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
       165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
       178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
       191, 192, 193, 194, 195, 196, 197, 198, 199], dtype=int32)
In [36]: M.indptr
Out[36]: array([  0, 100], dtype=int32)

It has replaced the n elements of arr with 2 arrays each with n/2 elements. Even if I replace the int with uint8, the M.indices array will still be int32.

The fact that your pandas version has half the memory usage,suggests that it is just storing the indices, and some how noting that the data part is all 1s. But that's just a guess.

How much greater sparification do you expect?

====================

http://pandas.pydata.org/pandas-docs/stable/sparse.html

This example looks like pandas is implementing some sort of 'run' compression:

In [4]: sts
Out[4]: 
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)

It has identified 2 blocks, of length 2 each. It still has to store the 4 nonfill values in some array.

A csr sparse equivalent (for a row array):

In [1052]: arr=np.random.rand(10)
In [1053]: arr[2:-2]=0
In [1055]: M=sparse.csr_matrix(arr)
In [1056]: M
Out[1056]: 
<1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
In [1057]: M.data
Out[1057]: array([ 0.37875012,  0.73703368,  0.7935645 ,  0.22948213])
In [1058]: M.indices
Out[1058]: array([0, 1, 8, 9], dtype=int32)
In [1059]: M.indptr
Out[1059]: array([0, 4], dtype=int32)

The pandas version might be more compact if the fill values occur in blocks. But I suspect

0         1.0
1         1.0
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8         1.0
9         1.0

would produce the same blocks. I don't see evidence of it's trying to identify the identical 1.0 values, and storing those as a value plus a count.

================

Based on @MaxU answer your ds stores 1000 1's, and two single element arrays that tell it where those values are stored.

In [56]: sp.memory_usage()
Out[56]: 1080

In [57]: sp.sp_index
Out[57]:
BlockIndex
Block locations: array([1000])
Block lengths: array([1000])

As long the nonfill values occur in big runs, the block arrays will be small. But if you scattered those 1000 values through out the series, you'd multiply the number of blocks substantially

 block locations: array([1,3,6,10,...])
 block lengths: array([1,1,1,2,1,...])

I can imagine mapping between the csr layout and the pandas blocks, but haven't worked out the details. The csr layout is meant to work with 2d arrays, with a clear concept of rows and columns. Looks like a sparse dataframe just contains sparse series objects.

===================

https://stackoverflow.com/a/38157234/901925 shows how to map from sparse dataframe values to a scipy sparse matrix. For each column (data series) it uses sp_values,fill_value,sp_index.

pandas/pandas/sparse/scipy_sparse.py has the code for interaction between scipy sparse and data series.

===================

kind='integer' produces sparse structure more likescipy.sparse`:

In [62]: n=5; s=pd.Series([0]*5+[1]*5, dtype=int)
In [63]: ss=s.to_sparse(fill_value=0, kind='integer')
In [64]: ss
Out[64]: 
0    0
1    0
2    0
3    0
4    0
5    1
6    1
7    1
8    1
9    1
dtype: int32
IntIndex
Indices: array([5, 6, 7, 8, 9])

contrast that with the default block:

dtype: int32
BlockIndex
Block locations: array([5])
Block lengths: array([5])

And equivalent column sparse matrix can be built with:

In [89]: data=ss.values
In [90]: data=ss.sp_values
In [91]: rows=ss.sp_index.indices
In [92]: cols=np.zeros_like(rows)
In [93]: sparse.csr_matrix((data,(rows,cols)))
Out[93]: 
<10x1 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>

There is a to_coo method, but it only works with the more complex pd.MultiIndex object (why?).

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I tagged it scipy for that exact reason. Always learning ;-) – piRSquared Nov 14 '16 at 19:04
  • I'm less interested in the actual measure of sparsification but rather the ideas behind it. I'll have to go back to pandas and mess with the ordered nature of the zeros and ones to see what other insights I can glean. Thx – piRSquared Nov 14 '16 at 19:07
  • 1
    For what it's worth, the `scipy.sparse` module was modeled on the MATLAB sparse implementation, and other sparse linear algebra work. So formats like `csr` are optimized for things like matrix product and large linear equation solutions. e.g. big finite difference and finite element problems. – hpaulj Nov 14 '16 at 19:07
  • `getsizeof` is just measuring the size of the `M` object. It does not include the size of the `data` and `indices` arrays that I printed. `getsizeof` only sees the pointers to those arrays, not the arrays themselves. – hpaulj Nov 14 '16 at 20:27
3

Pandas documentation says:

Currently, float64, int64 and bool dtypes are supported.

so let's try to convert your series to bool value:

In [53]: s.memory_usage()
Out[53]: 8080

In [54]: s.to_sparse().memory_usage()
Out[54]: 4080

In [55]: sp = s.astype(bool).to_sparse()

In [56]: sp.memory_usage()
Out[56]: 1080

In [57]: sp.sp_index
Out[57]:
BlockIndex
Block locations: array([1000])
Block lengths: array([1000])
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419