The main logic of my question is on comparing the two dataframes a little, but it will be different from the existing questions here. Q1, Q2, Q3
Let's create dummy two dataframes.
data1 = {'user': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,4],
'checkinid': [10, 20, 30, 40, 50, 35, 45, 55, 20, 120, 100, 35, 55, 180, 200,400],
'count': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
data2 = {'checkinid': [10, 20, 30, 35, 40, 45, 50,55, 60, 70,100,120,180,200,300,400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
data2 consists of whole checkinid values. I am trying to create a training file.
For example, user 1 visited 5 places where ids are (10,20,30,40,50)
I want to add randomly the places that user 1 does not visit and set the 'count column' as 0.
My expectation dataframe like this
user checkinid count
1 10 1
1 20 1
1 30 1
1 40 1
1 50 1
1 300 0 (add randomly)
1 180 0 (add randomly)
1 55 0 (add randomly)
2 35 1
2 45 1
2 55 1
2 20 1
2 120 1
2 10 0 (add randomly)
2 400 0 (add randomly)
2 180 0 (add randomly)
... ...
Now those who read the question can ask how many random data they will add. For each user, just add 3 non-visited places is enough for this example.