I have a csv file (n types of products rated by users):
Simplified illustration of source table
--------------------------------
User_id | Product_id | Rating |
--------------------------------
1 | 00 | 3 |
1 | 02 | 5 |
2 | 01 | 1 |
2 | 00 | 2 |
2 | 02 | 2 |
I load it into a pandas dataframe and I want to transform it, converting per ratings values from rows to columns in the following way:
- as a result of the conversion the number of rows will remain the same, but there will be 6 additional columns
- 3 columns (p0rt, p0rt, p2rt) each correspond to a product type. They need contain a product rating given by the user in this row to a product. Just one of the columns per row can have a rating and the other two must be zeros/nulls
- 3 columns (uspr0rt, uspr0rt, uspr2rt) need contain all product ratings provided by the user in Just one of the columns per row can have a rating and the other two must be zeros;values in columns related to products unrated by this user must be zeros/nulls
Desired output
------------------------------------------------------
User_id |p0rt |p1rt |p2rt |uspr0rt |uspr1rt |uspr2rt |
------------------------------------------------------
1 | 3 | 0 | 0 | 3 | 0 | 5 |
1 | 0 | 0 | 5 | 3 | 0 | 5 |
2 | 0 | 1 | 0 | 2 | 1 | 2 |
2 | 2 | 0 | 0 | 2 | 1 | 2 |
2 | 0 | 0 | 2 | 2 | 1 | 2 |
I will greatly appreciate any help with this. The actual number of distinct product_ids/product types is ~60,000 and the number of rows in the file is ~400mln, so performance is important.
Update 1
I tried using pivot_table but the dataset is too large for it to work (I wonder if there is a way to do it in baches)
df = pd.read_csv('product_ratings.csv')
df = df.pivot_table(index=['User_id', 'Product_id'], columns='Product_id', values='Rating')
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 983.4 GiB for an array with shape (20004, 70000000) and data type float64
Update 2
I tried "chunking" the data and applied pivot_table to a smaller chunk (240mln rows and "only" 1300 types of products) as a test, but this didn't work either:
My code:
df = pd.read_csv('minified.csv', nrows=99999990000, dtype={0:'int32',1:'int16',2:'int8'})
df_piv = pd.pivot_table(df, index=['product_id', 'user_id'], columns='product_id', values='rating', aggfunc='first', fill_value=0).fillna(0)
Outcome:
IndexError: index 1845657558 is out of bounds for axis 0 with size 1845656426
This is a known Pandas issue which is unresolved IndexError: index 1491188345 is out of bounds for axis 0 with size 1491089723 I think i'll try Dask next, if this does not work, I guess I'll need to write the data reshaper myself in C++ or other lower level language