I would like to create a spare matrix in a vectorized way from dataframe, containing a vector of labels and a vector of values, while knowing all labels.
And another limitation is, that I cannot create the dense dataframe first, then convert it to a spare dataframe, because it is too big to be held in memory.
Example:
List of all possible labels:
all_labels = ['a','b','c','d','e',\
'f','g','h','i','j',\
'k','l','m','n','o',\
'p','q','r','s','t',\
'u','v','w','z']
Dataframe with values for specific labels in each row:
data = {'labels': [['b','a'],['q'],['n','j','v']],
'scores': [[0.1,0.2],[0.7],[0.3,0.5,0.1]]}
df = pd.DataFrame(data)
Expected dense output:
This is how I did it in a non-vectorized way, which is taking too much time:
from scipy import sparse
from scipy.sparse import coo_matrix
def labels_to_sparse(input_):
all_, lables_, scores_ = input_
rows = [0]*len(all_)
cols = range(len(all_))
vals = [0]*len(all_)
for i in range(len(lables_)):
vals[all_.index(lables_[i])] = scores_[i]
return coo_matrix((vals, (rows, cols)))
df['sparse_row'] = df.apply(
lambda x: labels_to_sparse((all_labels, x['labels'], x['scores'])), axis=1
)
df
Even though this works, it is extremely slow with larger data, due to having to use df.apply
. Is there a way to vectorize this function, to avoid using apply
?
At the end, I want to use this dataframe to create matrix:
my_result = sparse.vstack(df['sparse_row'].values)
my_result.todense() #not really needed - just for visualization
EDIT
To sum up accepted solution (provided by @Divakar):
all_labels = np.sort(all_labels)
n = len(df)
lens = list(map(len,df['labels']))
l_ar = np.concatenate(df['labels'].to_list())
d = np.concatenate(df['scores'].to_list())
R = np.repeat(np.arange(n),lens)
C = np.searchsorted(all_labels,l_ar)
my_result = coo_matrix( (d, (R, C)), shape = (n,len(all_labels)))