0

I have a csv file (n types of products rated by users):

Simplified illustration of source table

--------------------------------
User_id | Product_id | Rating  |
--------------------------------
1       | 00         | 3       |  
1       | 02         | 5       |
2       | 01         | 1       |
2       | 00         | 2       |
2       | 02         | 2       |

I load it into a pandas dataframe and I want to transform it, converting per ratings values from rows to columns in the following way:

  • as a result of the conversion the number of rows will remain the same, but there will be 6 additional columns
  • 3 columns (p0rt, p0rt, p2rt) each correspond to a product type. They need contain a product rating given by the user in this row to a product. Just one of the columns per row can have a rating and the other two must be zeros/nulls
  • 3 columns (uspr0rt, uspr0rt, uspr2rt) need contain all product ratings provided by the user in Just one of the columns per row can have a rating and the other two must be zeros;values in columns related to products unrated by this user must be zeros/nulls

Desired output

------------------------------------------------------
User_id |p0rt |p1rt |p2rt |uspr0rt |uspr1rt |uspr2rt |
------------------------------------------------------
1       |  3  |  0  |  0  |   3    |   0    |   5    |
1       |  0  |  0  |  5  |   3    |   0    |   5    |
2       |  0  |  1  |  0  |   2    |   1    |   2    |
2       |  2  |  0  |  0  |   2    |   1    |   2    |
2       |  0  |  0  |  2  |   2    |   1    |   2    |

I will greatly appreciate any help with this. The actual number of distinct product_ids/product types is ~60,000 and the number of rows in the file is ~400mln, so performance is important.

Update 1

I tried using pivot_table but the dataset is too large for it to work (I wonder if there is a way to do it in baches)

df = pd.read_csv('product_ratings.csv')
df = df.pivot_table(index=['User_id', 'Product_id'], columns='Product_id', values='Rating')

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 983.4 GiB for an array with shape (20004, 70000000) and data type float64

Update 2

I tried "chunking" the data and applied pivot_table to a smaller chunk (240mln rows and "only" 1300 types of products) as a test, but this didn't work either:

My code:

df = pd.read_csv('minified.csv', nrows=99999990000, dtype={0:'int32',1:'int16',2:'int8'})
df_piv = pd.pivot_table(df, index=['product_id', 'user_id'], columns='product_id', values='rating', aggfunc='first', fill_value=0).fillna(0)

Outcome:

IndexError: index 1845657558 is out of bounds for axis 0 with size 1845656426

This is a known Pandas issue which is unresolved IndexError: index 1491188345 is out of bounds for axis 0 with size 1491089723 I think i'll try Dask next, if this does not work, I guess I'll need to write the data reshaper myself in C++ or other lower level language

Mkaerobus
  • 1
  • 4
  • It really seems to me like you're converting from a useful format - ratings by customer and product - into a useless format. Why are you expanding like this? What is the next stage? – Reinderien Jul 15 '22 at 23:46
  • @Reinderien, thanks for looking at this!!! I pass the resulting columns as neural net inputs to calculate embeddings (columns p0rt, ... are inputs, columns uspr0rt, ... are outputs). This seems like a very standard operation, and there are many questions like this (e.g. https://stackoverflow.com/questions/71513262/pandas-memory-error-for-large-dataframe-pivot or https://stackoverflow.com/questions/64089038/how-can-i-pivot-a-really-large-dataframe-using-dask or https://stackoverflow.com/questions/64116324/pandas-pivot-table-groupby-taking-too-long-on-very-large-dataframe), but none solved. – Mkaerobus Jul 16 '22 at 05:59

0 Answers0