4

I have one DataFrame where different rows can have the same value for one column.
As an example:

import pandas as pd
df = pd.DataFrame( { 
    "Name" : ["Alice", "Bob", "John", "Mark", "Emma" , "Mary"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

     City       Name
0    Seattle    Alice
1    Seattle    Bob
2    Portland   John
3    Seattle    Mark
4    Seattle    Emma
5    Portland   Mary

Here, a given value for "City" (e.g. "Portland") is shared by several rows.

I want to create from this data frame several data frames that have in common the value of one column. For the example above, I want to get the following data frames:

     City       Name
0    Seattle    Alice
1    Seattle    Bob
3    Seattle    Mark
4    Seattle    Emma

and

     City       Name
2    Portland   John
5    Portland   Mary

From this answer, I am creating a mask that can be used to generate one data frame:

def mask_with_in1d(df, column, val):
    mask = np.in1d(df[column].values, [val])
    return df[mask]

# Return the last data frame above
mask_with_in1d(df, 'City', 'Portland')

The problem is to create efficiently all data frames, to which a name will be assigned. I am doing it this way:

unique_values = np.sort(df['City'].unique())
for city_value in unique_values:
    exec("df_{0} = mask_with_in1d(df, 'City', '{0}')".format(city_value))

which gives me the data frames df_Seattle and df_Portland that I can further manipulate.

Is there a better way of doing this?

2 Answers2

4

Have you got a fixed list of cities you want to do this for? Simplest solution is to group by city and can then loop over the groups

for city, names in df.groupby("City"):
    print(city)
    print(names)

Portland
       City  Name
2  Portland  John
5  Portland  Mary
Seattle
      City   Name
0  Seattle  Alice
1  Seattle    Bob
3  Seattle   Mark
4  Seattle   Emma

Could then assign to a dictionary or some such (df_city[city] = names) if you wanted df_city["Portland"] to work. Depends what you want to do with the groups once split.

Ken Syme
  • 3,532
  • 2
  • 17
  • 19
  • Thanks, definitely cleaner. I would do this for all unique city values in the data frame. Creating a dictionary will be the best approach for my use case. –  Nov 20 '17 at 13:17
2

You can use groupby for this:

dfs = [gb[1] for gb in df.groupby('City')]

This will construct a list of dataframes, one per value of the 'City' column.

In case you want tuples with the value of the dataframe, you can use:

dfs = list(df.groupby('City'))

Note that assigning by name is usually an anti-pattern. And exec and eval are definitely antipatterns.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • Thanks, glad to know about these alternatives in terms of the output data structure. I learned about `exec` from another code that is not mine I was working on. Good to know this should be avoided in general. (Just found this anti-pattern [little book](https://docs.quantifiedcode.com/python-anti-patterns/) which seems worth to go through.) –  Nov 20 '17 at 13:20