4

I would like to create a spare matrix in a vectorized way from dataframe, containing a vector of labels and a vector of values, while knowing all labels.

And another limitation is, that I cannot create the dense dataframe first, then convert it to a spare dataframe, because it is too big to be held in memory.


Example:

List of all possible labels:

all_labels = ['a','b','c','d','e',\
          'f','g','h','i','j',\
          'k','l','m','n','o',\
          'p','q','r','s','t',\
          'u','v','w','z']

Dataframe with values for specific labels in each row:

data = {'labels': [['b','a'],['q'],['n','j','v']],
        'scores': [[0.1,0.2],[0.7],[0.3,0.5,0.1]]}
df = pd.DataFrame(data)

df

Expected dense output:

matrix


This is how I did it in a non-vectorized way, which is taking too much time:

from scipy import sparse
from scipy.sparse import coo_matrix

def labels_to_sparse(input_):
    all_, lables_, scores_ = input_
    rows = [0]*len(all_)
    cols = range(len(all_))
    vals = [0]*len(all_)
    for i in range(len(lables_)):
        vals[all_.index(lables_[i])] = scores_[i]

    return coo_matrix((vals, (rows, cols)))

df['sparse_row'] = df.apply(
        lambda x: labels_to_sparse((all_labels, x['labels'], x['scores'])), axis=1
)

df

Even though this works, it is extremely slow with larger data, due to having to use df.apply. Is there a way to vectorize this function, to avoid using apply?

At the end, I want to use this dataframe to create matrix:

my_result = sparse.vstack(df['sparse_row'].values)
my_result.todense() #not really needed - just for visualization

EDIT

To sum up accepted solution (provided by @Divakar):

all_labels = np.sort(all_labels)


n = len(df)
lens = list(map(len,df['labels']))
l_ar = np.concatenate(df['labels'].to_list())
d = np.concatenate(df['scores'].to_list())
R = np.repeat(np.arange(n),lens)
C = np.searchsorted(all_labels,l_ar)

my_result = coo_matrix( (d, (R, C)), shape = (n,len(all_labels)))
matt525252
  • 642
  • 1
  • 14
  • 21

2 Answers2

2

Here's one based on np.searchsorted -

n = len(df)
lens = list(map(len,df['labels']))
l_ar = np.concatenate(df['labels'])
d = np.concatenate(df['scores'])
out = np.zeros((n,len(all_labels)),dtype=d.dtype)
R = np.repeat(np.arange(n),lens)
C = np.searchsorted(all_labels,l_ar)
out[R, C] = d

Note : If all_labels is not sorted, we need to use sorter arg with searchsorted.

To get into a sparse-matrix output,like coo_matrix -

from scipy.sparse import csr_matrix,coo_matrix

out_sparse = coo_matrix( (d, (R, C)), shape = (n,len(all_labels)))
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Ia there a way that `out` would be sparse matrix? If I understand this correctly, `out` contains the result but it is a numpy array. Also, I had to add `.to_list()` before calling `np.concatenate`. There was no problem with example from this question, but with real dataset (where labels are words/phrases) it did not run without it (KeyError: 0). – matt525252 Dec 09 '19 at 10:21
  • `out_sparse` command is failing on: `ValueError: column index exceeds matrix dimensions`. My real dimensions: `len(all_labels)` - 9933; `n` - 407447; `len(lens)` - 407447; `len(l_ar)` - 3018669; `d.shape` - (3018669,); `R.shape` - (3018669,); `C.shape` - (3018669,) – matt525252 Dec 10 '19 at 09:56
  • @matt525252 Is `all_labels` sorted? – Divakar Dec 10 '19 at 09:57
  • @matt525252 Then as mentioned in the post, use `sorter` arg. Take inspiration from this post - https://stackoverflow.com/a/33678576/. – Divakar Dec 10 '19 at 09:59
  • When I first sort it (`all_labels = np.sort(all_labels)`), then your solution works. And it is really fast. Thanks for helping out! :) – matt525252 Dec 10 '19 at 10:09
1

Here are a couple of alternative methods you could try.

Method 1 - Restructure your DataFrame with a list comprehension and reindex

from string import ascii_lowercase

all_labels = list(ascii_lowercase)

my_result = (pd.DataFrame([dict(zip(l, v)) for _, (l, v) in df.iterrows()])
             .reindex(columns=all_labels).fillna(0).values)

Method 2 - for loop with updating values using loc

my_result = pd.DataFrame(np.zeros((len(df), len(all_labels))), columns=all_labels)

for i, (lab, val) in df.iterrows():
    my_result.loc[i, lab] = val

my_result = my_result.values

Both should yield the same output.

[out]

[[0.2 0.1 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.7 0.
  0.  0.  0.  0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  0.5 0.  0.  0.  0.3 0.  0.  0.  0.
  0.  0.  0.  0.1 0.  0.  0.  0. ]]
Chris Adams
  • 18,389
  • 4
  • 22
  • 39