-1

I have two dataframes, let's call them Train and LogItem. There is a column called user_id in both of them.

For each row in Train, I pick the user_id and a date field and then pass it to a function which returns some values by calculating it from the LogItem dataframe which I use to populate column in Train(LogEntries_7days,Sessioncounts_7days) against the location of that particular row.

def ServerLogData(user_id,threshold,threshold7,dataframe):
  dataframe = LogItem[LogItem['user_id']==user_id]
  UserData = dataframe.loc[(dataframe['user_id']==user_id) & 
  (dataframe['server_time']<threshold) & 
  (dataframe['server_time']>threshold7)]
  entries = len(UserData)
  Unique_Session_Count = UserData.session_id.nunique()
  return entries,Unique_Session_Count

for id in Train.index:
  print (id)    
  user_id = (Train.loc[[id],['user_id']].values[0])[0]
  threshold = (Train.loc[[id],['impression_time']].values[0])[0]
  threshold7 = (Train.loc[[id],['AdThreshold_date']].values[0])[0]
  dataframe=[]
  Train.loc[[id],'LogEntries_7days'],Train.loc[[id],'Sessioncounts_7days'] = 
  ServerLogData(user_id,threshold,threshold7,dataframe)

This approach is incredibly slow and just like in databases, can we use apply method here or something else which could be fast enough.

Please suggest me a better approach

Edit: Based on suggestions from super-helpful colleagues here, I am putting some data images for both dataframes and some explanation. In dataframe Train, there will be user actions with some date values and there will be multiple rows for a user_id. For each row, I pass user_id and dates to another dataframe and calculate some values. Please note that the second dataframe too has multiple rows for user_id for different dates. So grouping them does not seem be an option here. I pass user_id and dates, flow goes to second dataframe and find rows based on user_id which fits the dates too that I passed.

enter image description here

enter image description here

ChandanJha
  • 99
  • 1
  • 9

2 Answers2

0

Try to do a groupby user_id then you pass each user's history as a dataframe, I think it will get you faster results than passing your Train line by line. I have used this method on log files data and it wasn't slow I don't know if it is the optimal solution but I found the results satisfying and quite easy to implement. Something like this:

group_user = LogItem.groupby('user_id')
group_train = Train.groupby('user_id')
user_ids = Train['user_id'].unique().tolist()
for x in user_ids:
    df_user = group_user.get_group(x)
    df_train = group_train.get_group(x)
    # do your thing here
    processing_function(df_user, df_train)

Write a function doing the calculation you want (I named it processing_function). I hope it helps.

EDIT: here is how your code becomes

def ServerLogData(threshold, threshold7, df_user):
    UserData = df_user[(df_user['server_time'] < threshold) & (df_user['server_time'] > threshold7)]
    entries = len(UserData)
    Unique_Session_Count = UserData.session_id.nunique()
    return entries, Unique_Session_Count


group_user = LogItem.groupby('user_id')
group_train = Train.groupby('user_id')
user_ids = Train['user_id'].unique().tolist()
for x in user_ids:
    df_user = group_user.get_group(x)
    df_train = group_train.get_group(x)
    for id in df_train.index:
        user_id = (df_train.loc[[id], ['user_id']].values[0])[0]
        threshold = (df_train.loc[[id], ['impression_time']].values[0])[0]
        threshold7 = (df_train.loc[[id], ['AdThreshold_date']].values[0])[0]
        df_train.loc[[id], 'LogEntries_7days'], df_train.loc[[id], 'Sessioncounts_7days'] = ServerLogData(threshold, threshold7, df_user)
Souha Gaaloul
  • 328
  • 4
  • 9
  • @Souha-Thanks for your help here . Please see data and my edits in the question. – ChandanJha Aug 28 '19 at 12:25
  • @ChandanJha "Please note that the second dataframe too has multiple rows for user_id for different dates. So grouping them does not seem be an option here." I don't see why you think that is not an option. If the user_id is unique and is the same for each user, the groupby function returns a dataframe of that specific user_id including all his usage history and all the different dates: it just runs through the whole dataframe and returns rows with that specific user_id. – Souha Gaaloul Aug 28 '19 at 13:02
  • @Souha- Let me try it out and will let you know in sometime. Thanks for your valuable suggestion. – ChandanJha Aug 28 '19 at 13:27
  • @Souha- I get sense what you are trying to recommend by group by. this way the second dataframe where I am actually updating some columns for a user id, your approach will make us touch a particular user_id only once. However, not able to implement it well. – ChandanJha Aug 29 '19 at 15:50
  • @Souha- I get sense what you are trying to recommend by group by. owever, not able to implement it well. Example: group user user_id server_time session_id 1 2018-01-01 435 2 2018-01-02 564 1 2018-01-04 675 1 2018-01-05 454 1 2018-01-06 920 group_train user_id impression_time count 1 2018-01-03 0 1 2018-01-05 0 Logic: impression_time>server_time Expected output for group_train is: user_id impression_time count 1 2018-01-03 2 1 2018-01-05 3 – ChandanJha Aug 29 '19 at 16:00
  • @ChandanJha I don't understand your last comment, but if you follow my method the way I have specified in my code block above you should replace `for id in Train.index: ` by `for id in df_train.index: ` and in your function ServerLogData pass as a parameter df_user (delete the parameter dataframe because it is the definition of df_user) and instead of passing user_id pass x (because it is the user id of that dataframe you passed) that will do I guess – Souha Gaaloul Aug 29 '19 at 16:14
  • the user_id parameter becomes useless delete it and here `UserData = dataframe.loc[(dataframe['user_id']==user_id) & (dataframe['server_time']threshold7)]` just becomes `UserData =(df_user['server_time']threshold7)]` – Souha Gaaloul Aug 29 '19 at 16:19
  • That way your code becomes more readable because the logic is more obvious, about rapidity you have to test it – Souha Gaaloul Aug 29 '19 at 16:20
  • @ChandanJha I edited my response and added the transformed code, please check the logic perhaps I might made some mistakes as I don't totally understand your use case – Souha Gaaloul Aug 29 '19 at 16:31
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/198675/discussion-between-souha-and-chandanjha). – Souha Gaaloul Aug 29 '19 at 16:32
  • Thank you so much @Souha for being so patient. I joined the chatroom. – ChandanJha Aug 29 '19 at 16:54
0

If you have a really large dataframe, printing each row is going to eat up a lot of time, and it's not like you'll be able to read throw thousands of lines of output anyway.

If you have a lot of rows for each id, then you can speed it up quite a bit by processing each id only once. There's a question that discsusses filtering a dataframe to unique indices. The top rated answer, adjusted for this case, would be unique_id_df = Train.loc[~Train.index.duplicated(keep='first')]. That creates a dataframe with only one row for each id. It takes the first row for each id, which seems to be what you're doing as well.

You can then create a dataframe from applying your function to unique_id_df. There are several ways to do this. One is to create a series entries_counts_series = unique_id_df.apply(ServerLogData,axis=1) and then turn it into a dataframe with entries_counts_df = pd.DataFrame(entries_counts_series.tolist(), index = entries_counts_series.index). You could also put the data into unique_id_df with unique_id_df['LogEntries_7days'],unique_id_df['Sessioncounts_7days'] = zip(*unique_id_df.apply(ServerLogData,axis=1), but then you would have a bunch of extra columns to get rid of.

Once you have your data, you can merge it with your original dataframe: Train_with_data = Train.merge(entries_counts_df, left_index = True, right_index = True). If you put the data into unique_id_df, you could do something such as Train_with_data = Train.merge(unique_id_df[['LogEntries_7days','Sessioncounts_7days']], left_index = True, right_index = True).

Try out different variants of this and the other answers, and see how long each of them take on a subset of your data.

Also, some notes on ServerLogData:

  1. dataframe is passed as a parameter, but then immediately overwritten.
  2. You subset LogItem to where LogItem['user_id']==user_id, but then you check that condition again. Unless I'm missing something, you can get rid of the dataframe = LogItem[LogItem['user_id']==user_id] line.
  3. You've split the line that sets UserData up, which is good, but standard style is to indent the lines in this sort of situation.
  4. You're only using session_id, so you only need to take that part of the dataframe.

So:

def ServerLogData(user_id,threshold,threshold7):
    UserData = LogItem.session_id.loc[(LogItem['user_id']==user_id) & 
                             (LogItem['server_time']<threshold) & 
                             (LogItem['server_time']>threshold7)]
    entries = len(UserData)
    Unique_Session_Count = UserData.nunique()
    return entries, Unique_Session_Count

I did some quite-possibly-not-representative tests, and subsetting column, rather than subsetting the entire dataframe and then taking the column out from that dataframe, sped things up significantly.

Acccumulation
  • 3,491
  • 1
  • 8
  • 12
  • @Accumulation- Thanks a lot for putting efforts to help me here. I did try your suggestion but it did not cut down the processing time by a big margin. I added my data images and added some comments at the end again. Kindly take a look. – ChandanJha Aug 28 '19 at 12:26