0

I have this csv file below (products rated by users) which into pandas dataframe:

--------------------------------
User_id | Product_id | Rating  |
--------------------------------
1       | 00         | 3       |  
1       | 02         | 5       |
2       | 01         | 1       |
2       | 00         | 2       |
2       | 02         | 2       |

I want to change the dataframe so that it has the same number of rows as the source table above, but only two columns:

  • Column 1: needs to be a list of L length (L = total number of existing kinds of products), and where the n-th value (n = product_id) in the list is the rating given by the user in this row to the product. All all other values in the list need to be zeros
  • column 2 should be a list of the same L length, where the n-ths values equal to ratings for n-ths products (n = product_id) for all product_ids rated by this user (in the entire table); all other (unrated) values that are not rated by the user need to be zeros

The desired result would be (consistent with the example above):

--------------------------------
User_id | col1       | col2    |
--------------------------------
1       | [3,0,0]    | [3,0,5] |  
1       | [0,0,5]    | [3,0,5] |
2       | [0,1,0]    | [2,1,2] |
2       | [2,0,0]    | [2,1,2] |
2       | [0,0,2]    | [2,1,2] |

I will greatly appreciate any help with this. Please do ask questions if i can make the question & explanation more clear.

Mkaerobus
  • 1
  • 4

2 Answers2

0

I didn't get your question exactly but you can convert a 1d list to one hot 2-d vector using:

def one_hot(x, num_classes=3):
    return np.eye(num_classes)[x]

one_hot([0, 1, 2, 1])
>array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In that case your num_classes should be the total number of products and the list should contain numbers between 0 and num_products-1

you can also convert [3, 0, 5] to [3, 0, 0] and [0,0,5] using:

w, v = np.linalg.eigh(np.diag([3, 0, 5]))
w*v
>array([[0., 3., 0.],
       [0., 0., 0.],
       [0., 0., 5.]])
bpfrd
  • 945
  • 3
  • 11
  • Thanks for your effort!! I don't think this addresses my question directly, but let me try and see if this can help. – Mkaerobus Jul 13 '22 at 12:34
  • What should I pass as X in your one_hot() function? – Mkaerobus Jul 13 '22 at 12:35
  • list of product_id's for each user – bpfrd Jul 13 '22 at 12:39
  • I see, the problem is that I don't need 1/2s and 0s returned in the result. I need rating values for the respective products (rated by this user), so the function need to take the rating values in addition to the product_ids (with which the ratings are associated) – Mkaerobus Jul 13 '22 at 12:46
0

I managed to solve this, however it feels like a lot of expensive code & operations for something relatively simple. If you have any ideas how to simplify this, I'd appreciate it a lot.

df = pd.read_csv('interactionsv21test.csv')

number_of_products = df['product_id'].nunique()

#assign indexes to products https://stackoverflow.com/questions/38088652/pandas-convert-categories-to-numbers
df.product_id = pd.Categorical(df.product_id)
df['product_indx'] = df.product_id.cat.codes
print('source table')
print(df.sort_values(['user_id', 'product_id', 'product_indx'], ascending=True).head(n=3))

df1 = (df.groupby([df['user_id']])
       .apply(lambda x: {int(i):int(k) for i,k in zip(x['product_indx'], x['rating'])})
       .reset_index(name='rating'))

#add blank values for non existing dictionary values https://stackoverflow.com/questions/38987/how-do-i-merge-two-dictionaries-in-a-single-expression
df1['rating_y'] = (df1['rating'].apply(lambda x: {int(k): 0 for k in range(number_of_products )} | x))

df['rating_x'] = df.apply(lambda row: {row['product_indx']:row['rating']}, axis=1)
df['rating_x'] = (df['rating_x'].apply(lambda x: {int(k): 0 for k in range(number_of_products)} | x ))

df = df[['user_id', 'rating_x']].merge(df1[['user_id','rating_y']],how='inner',left_on=['user_id'],right_on=['user_id'])

pd.set_option('display.max_columns', 7)
pd.set_option('display.width', 1000)

print('final result')
print(df.head(n=3))

Output:

source table
    user_id prod_id  rating  is_reviewed  prod_indx
40      198      63       5            1          1
0       198    2590       4            1         41
5       198    6960       4            1         51

final result
    user_id  rating_x                                           rating_y
...
40      198  {0: 0, 1: 5, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: ...  {0: 0, 1: 5, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: ...
41      198  {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: ...  {0: 0, 1: 5, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: ...
42      198  {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: ...  {0: 0, 1: 5, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: ...
Mkaerobus
  • 1
  • 4
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 15 '22 at 08:43