Compare two dataframe and conditionally capture random data in Python

Question

The main logic of my question is on comparing the two dataframes a little, but it will be different from the existing questions here. Q1, Q2, Q3

Let's create dummy two dataframes.

data1 = {'user': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,4], 
         'checkinid': [10, 20, 30, 40, 50, 35, 45, 55, 20, 120, 100, 35, 55, 180, 200,400],
         'count': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}

data2 = {'checkinid': [10, 20, 30, 35, 40, 45, 50,55, 60, 70,100,120,180,200,300,400]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

data2 consists of whole checkinid values. I am trying to create a training file.

For example, user 1 visited 5 places where ids are (10,20,30,40,50)

I want to add randomly the places that user 1 does not visit and set the 'count column' as 0.

My expectation dataframe like this

user checkinid count
1       10        1
1       20        1
1       30        1
1       40        1
1       50        1
1       300       0 (add randomly)
1       180       0 (add randomly)
1       55        0 (add randomly)
2       35        1
2       45        1
2       55        1  
2       20        1
2       120       1
2       10        0 (add randomly)
2       400       0 (add randomly)
2       180       0 (add randomly)
...     ...

Now those who read the question can ask how many random data they will add. For each user, just add 3 non-visited places is enough for this example.

Please go through the [intro tour](https://stackoverflow.com/tour), the [help center](https://stackoverflow.com/help) and [how to ask a good question](https://stackoverflow.com/help/how-to-ask) to see how this site works and to help you improve your current and future questions, which can help you get better answers. "Show me how to solve this coding problem?" is off-topic for Stack Overflow. You have to make an honest attempt at the solution, and then ask a *specific* question about your implementation. Stack Overflow is not intended to replace existing tutorials and documentation. — Prune, Feb 06 '21 at 19:11
Yes, you found how to compare two data frames, but that's not the problem you face. You're merely finding the difference of two lists (one simple research problem), and then selecting 3 random values from that list of differences (another simple research problem). From your lack of posted code, I expect that you need to consult the `random` package documentation. Look especially at `choice` and `sample`. — Prune, Feb 06 '21 at 19:13

A.Shenoy · Answer 1 · 2021-02-07T05:44:07.287

This might not be the best solution but it works you have to get each users and then pick the checkinids which are not assigned to them

#get all users
users = df1.user.unique();

for user in users:
    checkins = df1.loc[df1['user'] == user]
    df = checkins.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only'].sample(n=3)
    
    df['user']=[user,user,user]
    df['count']=[0,0,0]
    df.pop("_merge")
    df1 = df1.append(df, ignore_index=True)
    
    
#sort data frome based on user
df1 = df1.sort_values(by=['user']);

#re-arrange cols
df1 = df1[['user', 'checkinid', 'count']]

#print df
print df1

you can add rows randomly using sample(n=3) instead of iloc[:3], in df =checkins.merge..... line — A.Shenoy, Feb 07 '21 at 05:39

Compare two dataframe and conditionally capture random data in Python

1 Answers1