1

I have a data frame where I only want values ​​that contain a certain value. I've already implemented that. What I want now is the list grouped by user. What I get is every single element of the data frame in a list. How do I get this list [[User1.item1, ..., user1, itemn], ..., [Usern.item1, ..., usern, itemn]]

d = {'userid': [0, 0, 0, 1, 2, 2, 3, 3, 4, 4, 4],
     'itemid': [715, 845, 98, 12324, 85, 715, 2112, 85, 2112, 852, 102]}
df = pd.DataFrame(data=d)
print(df)

users = df.loc[df.itemid == 715, "userid"]
df_new = df.loc[df.userid.isin(users)]

list_new = df_new[['itemid']].values.tolist()
# What I get
[[715],[845],[98],[85],[715]]
# What I want
[[715,845,98],[85,715]]
Ella
  • 361
  • 3
  • 9

3 Answers3

3

You may use a groupby operation

list_new = df_new.groupby("userid")['itemid'].apply(list).tolist()
print(list_new)  # [[715, 845, 98], [85, 715]]

The intermediate operation is

list_new = df_new.groupby("userid")['itemid'].apply(list)
print(list_new)  

userid
0    [715, 845, 98]
2         [85, 715]
Name: itemid, dtype: object
azro
  • 53,056
  • 7
  • 34
  • 70
1

If you want to do all of your code in one line, you can use list comprehension:

[x for x in [*df.groupby('userid')['itemid'].apply(list)] if 715 in x]

[[715, 845, 98], [85, 715]]

The code:

[*df.groupby('userid')['itemid'].apply(list)]

is equivalent to

df_new.groupby("userid")['itemid'].apply(list).tolist()

and the remaining part is just looping through what is generated from that master list ^^^ to see if 715 is in any of the sublists, where x is the sublists in the code above.

David Erickson
  • 16,433
  • 2
  • 19
  • 35
0

1.We need to group our data by user id. Grouping is very important in many applications, such as in field of Machine learning preprocessing: Example: Suppose our data is collected from sensors at various stations which are located at various parts of a state. Suppose we are measuring pressure and temperature. Suppose for our understanding let there be three stations Station-1, Station-2 and Station-3. In many practical scenarios we may have missing values in our data. If we use entire data to fill missing values, we may not get good results. But if we only use it's station's data to fill missing values we can get good results(Since conditions are different at different stations. But it is similar at particular station).

  1. ans = df.groupby('userid')['itemid'].apply(list)
    
    userid
    0      [715, 845, 98]
    1             [12324]
    2           [85, 715]
    3          [2112, 85]
    4    [2112, 852, 102]
    Name: itemid, dtype: object
    

Each row gives each user's all itemid's