I have two dataframes, let's call them Train and LogItem. There is a column called user_id in both of them.
For each row in Train, I pick the user_id and a date field and then pass it to a function which returns some values by calculating it from the LogItem dataframe which I use to populate column in Train(LogEntries_7days,Sessioncounts_7days) against the location of that particular row.
def ServerLogData(user_id,threshold,threshold7,dataframe):
dataframe = LogItem[LogItem['user_id']==user_id]
UserData = dataframe.loc[(dataframe['user_id']==user_id) &
(dataframe['server_time']<threshold) &
(dataframe['server_time']>threshold7)]
entries = len(UserData)
Unique_Session_Count = UserData.session_id.nunique()
return entries,Unique_Session_Count
for id in Train.index:
print (id)
user_id = (Train.loc[[id],['user_id']].values[0])[0]
threshold = (Train.loc[[id],['impression_time']].values[0])[0]
threshold7 = (Train.loc[[id],['AdThreshold_date']].values[0])[0]
dataframe=[]
Train.loc[[id],'LogEntries_7days'],Train.loc[[id],'Sessioncounts_7days'] =
ServerLogData(user_id,threshold,threshold7,dataframe)
This approach is incredibly slow and just like in databases, can we use apply method here or something else which could be fast enough.
Please suggest me a better approach
Edit: Based on suggestions from super-helpful colleagues here, I am putting some data images for both dataframes and some explanation. In dataframe Train, there will be user actions with some date values and there will be multiple rows for a user_id. For each row, I pass user_id and dates to another dataframe and calculate some values. Please note that the second dataframe too has multiple rows for user_id for different dates. So grouping them does not seem be an option here. I pass user_id and dates, flow goes to second dataframe and find rows based on user_id which fits the dates too that I passed.