2

I'm new to Pandas and I'd like to know what I'm doing wrong in the following example.

I found an example here explaining how to get a data frame after applying a group by instead of a series.

df1 = pd.DataFrame( { 
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
"City" : ["Seattle", "Seattle", "Baires", "Caracas", "Baires", "Caracas"] })

df1['size'] = df1.groupby(['City']).transform(np.size)

df1.dtypes #Why is size an object? shouldn't it be an integer?

df1[['size']] = df1[['size']].astype(int) #convert to integer

df1['avera'] = df1.groupby(['City'])['size'].transform(np.mean) #group by again

Basically, I want to apply the same transformation to a huge data set I'm working on now, but I'm getting an error message:

budgetbid['meanpb']=budgetbid.groupby(['jobid'])['probudget'].transform(np.mean) #can't upload this data for the sake of explanation

ValueError: Length mismatch: Expected axis has 5564 elements, new values have 78421 elements

Thus, my questions are:

  1. How can I overcome this error?
  2. Why do I get an object type when apply group by with size instead of an integer type?
  3. Let us say that I want to get a data frame from df1 with unique cities and their respective count(*). I know I can do something like

    newdf=df1.groupby(['City']).size()

Unfortunately, this is a series, but I want a data frame with two columns, City and the brand new variable, let's say countcity. How can I get a data frame from a group-by operation like the one in this example?

  1. Could you give me an example of a select distinct equivalence here in pandas?
Community
  • 1
  • 1
nhern121
  • 3,831
  • 6
  • 27
  • 40

2 Answers2

5

Question 2: Why does df1['size'] have dtype object?

groupby/transform returns a DataFrame with a dtype for each column which is compatible with both the original column's dtype and the result of the transformation. Since Name has dtype object,

df1.groupby(['City']).transform(np.size)

is converted to dtype object as well.

I'm not sure why transform is coded to work like this; there might be some usecase which demands this to ensure correctness in some sense.


Questions 1 & 3: Why do I get ValueError: Length mismatch and how can I avoid it

There are probably NaNs in the column being grouped. For example, suppose we change one of the values in City to NaN:

df2 = pd.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : [np.nan, "Seattle", "Baires", "Caracas", "Baires", "Caracas"] })
grouped = df2.groupby(['City'])

then

In [86]: df2.groupby(['City']).transform(np.size)
ValueError: Length mismatch: Expected axis has 5 elements, new values have 6 elements

Groupby does not group the NaNs:

In [88]: [city for city, grp in  df2.groupby(['City'])]
Out[88]: ['Baires', 'Caracas', 'Seattle']

To work around this, use groupby/agg:

countcity = grouped.agg('count').rename(columns={'Name':'countcity'})
#          countcity
# City              
# Baires           2
# Caracas          2
# Seattle          1

and then merge the result back into df2:

result = pd.merge(df2, countcity, left_on=['City'], right_index=True, how='outer')
print(result)

yields

      City     Name  countcity
0      NaN    Alice        NaN
1  Seattle      Bob          1
2   Baires  Mallory          2
4   Baires      Bob          2
3  Caracas  Mallory          2
5  Caracas  Mallory          2

Question 4: Do you mean what is the Pandas equivalent of the SQL select distinct statement?

If so, perhaps you are looking for Series.unique or perhaps iterate through the keys in the Groupby object, as was done in

[city for city, grp in df2.groupby(['City'])]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
1

3.)

Just call pd.Dataframe() again:

newdf = pd.Dataframe(df1.City.value_counts())

or

newdf = pd.Dataframe(groupby(['City']).size())

4.) I think the select distinct euqivalent would just be using more than one column in your groupby. So for example,

df1.groupby(['City', 'Name']).size() would return the groupby object:

City     Name   
Baires   Bob        1
         Mallory    1
Caracas  Mallory    2
Seattle  Alice      1
         Bob        1
dtype: int64
dagrha
  • 2,449
  • 1
  • 20
  • 21
  • For 3, I would complement it by adding `newdf=newdf.reset_index('City')` and then `newdf.columns=['City','countcity']` – nhern121 Jul 10 '15 at 19:53