3

As I create new data frames for each customer I'd like to also create one giant data frame of all of them appended together.

I've created a function to group user data how I need it. Now I want to iterate over another data frame containing unique user keys and use those user keys to create data frames for each user. I'd then like to aggregate all those data frames into one giant data frame.

for index, row in unique_users.iterrows():
    customer = user_df(int(index))
    print(customer)

This function works as intended and prints a df for each customer

for index, row in unique_users.iterrows():
    top_users = pd.DataFrame()
    customer = user_df(int(index))
    top_users = top_users.append(customer)
print(top_users)

This only prints out the last customer's df

I expect that as it iterates and creates a new customer df it will append that to the top_user df so at the end I have one giant top_user df. But instead it only contains that last customer's df.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
amanda
  • 99
  • 1
  • 3
  • 2
    you re-declare `top_users` inside your for loop. set `top_users = pd.DataFrame()` before your loop and it should perform as you expect – wpercy May 28 '19 at 23:16
  • 2
    that being said, I doubt that you should be using `.iterrows()` to perform this aggregation, but it's impossible to tell without seeing the full code – wpercy May 28 '19 at 23:17
  • I second the suggestion that likely, what you are doing can be accomplished without `.iterrows`. If you describe your situation more fully, some pandas wiz can probably guide you to the "pandas way" of doing thing - pandonic you might say. You should consider things like `.iterrows` and `.itertuples` as last resorts. – juanpa.arrivillaga May 28 '19 at 23:23
  • Thanks that worked – amanda May 28 '19 at 23:24
  • I have two dataframes. One has the keys of unique users. The other has all of the event data from all users. I want to iterate over the unique_users dataframe and for each key then pull out all of the data related to that key and store it in a dataframe just for that user. That's what the user_df() function is doing. Open to whatever is the easiest way to do that! – amanda May 28 '19 at 23:27
  • @amanda it sounds like you want to `merge` (and then groupby user_id) – Andy Hayden May 28 '19 at 23:34
  • 2
    Hi amanda! If you could edit your post to include a faked up couple of input data frames, and a faked up list of what-you-want data frames it would really help. (`df1 = pd.DataFrame(....)\n df2 = pd.DataFr...`, and so on. I do *strongly* suspect that you don't want a DataFrame for *each user*, fwiw. Cheers! – Mike May 28 '19 at 23:39
  • use `top_users = pd.DataFrame()` before `for` – furas May 29 '19 at 00:57
  • What's the difference between your unique_users df and your customer df ? – vlemaistre May 29 '19 at 07:04

2 Answers2

2

As advised by @unutbu: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. Instead, build a list of data frames to call pd.concat once outside the loop.

And you can actually handle the data frame build with a list/dictionary comprehension without iterrows but directly using the index values. Using either comprehension, you avoid the bookkeeping of initializing a container and assigning iteratively to it.

# LIST COMPREHENSION APPROACH
df_list = [user_df(int(idx)) for idx in unique_users.index.values]
top_users = pd.concat(df_list, ignore_index=True)

# DICTIONARY COMPREHENSION APPROACH
df_dict = {idx: user_df(int(idx)) for idx in unique_users.index.values}
top_users = pd.concat(df_dict, ignore_index=True)
Parfait
  • 104,375
  • 17
  • 94
  • 125
1

This is what I do:

_list = []
for index, row in unique_users.iterrows():
    r = row.to_dict() # Converting the row to dictionary
    _list.append(r) # appending the dictionary to list
    
return pd.DataFrame(_list) # Converts list of dictionaries to a dataframe
Varun Kumar
  • 119
  • 1
  • 6