0

This is the information of the pandas dataframe I am working with:

Input:

final.info() 
print(final)

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410978 entries, 0 to 410977
Data columns (total 3 columns):
    Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   customer_id  410978 non-null  int64 
 1   item_id      410978 non-null  object
 2   item_qty     410978 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 9.4+ MB

        customer_id  item_id  item_qty
0             12346  15056BL         1
1             12346   15056N         1
2             12346   15056P         1
3             12346    20679         1
4             12346    20682         1
...             ...      ...       ...
410973        18287   85040B        12
410974        18287    85041        12
410975        18287   85199S        24
410976        18287   85232B        24
410977        18287       C2         1

[410978 rows x 3 columns]

I am trying to make a sparse matrix from this dataframe using the following code:

#Get all the unique users for the sparse matrix rows
customers = list(np.sort(final.customer_id.unique()))

#Get all the uniquee products for the sparse matrix columns
products = list(final.item_id.unique())

#Get all the purchases
quantity = list(final.item_qty)

#Get the row indices
cat_type = CategoricalDtype(categories=customers, ordered=True)
rows = final.customer_id.astype(cat_type)

#Get the column indices
cat_type1 = CategoricalDtype(categories=products, ordered=True)
cols = final.item_id.astype(cat_type1)

#Create the sparse matrix
sparse_purchase = sparse.csr_matrix((quantity,(rows,cols)), shape=(len(customers), len(products)))

Error:

    ValueError                                Traceback (most recent call last)
<ipython-input-47-058d94e47e42> in <module>()
     17 
     18 #Create the sparse matrix
---> 19 sparse_purchase = sparse.csr_matrix((quantity,(rows,cols)), shape=(len(customers), len(products)))

5 frames
/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

ValueError: invalid literal for int() with base 10: '15056BL'

I think sparse.csr_matrix requires integer inputs for the rows and columns, but since item_id is an object datatype I am not being able to create the sparse matrix. Converting item_id to int is also not possible. Is there some way to make a sparse matrix with object datatype?

Any help would be appreciated.

Wojack
  • 1
  • 1

0 Answers0