1

On the last post i posted that i have a problem to remove/sum row based on certains conditions and someone helped to right a code like this.

Here is my code:

import pandas as pd
df=pd.DataFrame({
                'cars':['Kia rio','Bmw','Mercedes','Ford','Kia','Mercedes Benz'],
                'rent':[1,1,2,1,4,2],
                'sale':[2,4,1,1,5,1],
                'id':[2000,1000,3000,4000,2000,3000]
                })
print(df)
df1 = df.drop_duplicates().groupby(['id'], sort=False, as_index=False).sum()
print(df1)

But when I run the groupby method it drop car columns. Can anyone help me on this?

I got this output:

     id  rent  sale
0  2000     5     7
1  1000     1     4
2  3000     4     2
3  4000     1     1

Expected output:

    cars      rent  sale  id
    Kia       5      7    2000
    Bmw       1      4    1000
    Mercedes  2      1    3000
    Ford      1      1    4000
feelsgood
  • 135
  • 14

3 Answers3

2

You need aggregate columns for keep it in ouput (or pass to by parameter like id), so for avoid lost cars column is used aggregate function last for last values of cars per groups, also is specified aggregation sum for another 2 columns:

df1 = df.drop_duplicates().groupby('id',sort=False,as_index=False).agg(cars=('cars','last'),
                                                                       rent=('rent', 'sum'),
                                                                       sale=('sale', 'sum'))
print(df1)
     id           cars  rent  sale
0  2000            Kia     5     7
1  1000            Bmw     1     4
2  3000  Mercedes Benz     4     2
3  4000           Ford     1     1

If possible split carnames by first space and aggregate by id and cars:

df['cars'] = df['cars'].str.split().str[0]
df1 = df.drop_duplicates().groupby(['id','cars'], sort=False, as_index=False).sum()
print(df1)
     id      cars  rent  sale
0  2000       Kia     5     7
1  1000       Bmw     1     4
2  3000  Mercedes     2     1
3  4000      Ford     1     1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @AhmedChater - there is in data `'Mercedes','Mercedes Benz'`, so different values, so `drop_duplicates()` not removed dupes. – jezrael Jun 23 '21 at 11:12
0

I believe it is happening because cars is non-numeric and this is skipped by default on DataFrame sum. Sum has an optional parameter numeric_onlybool which says:

numeric_onlybool, default None

    Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html

This question also has some good info about grouping the string parts: Pandas groupby: How to get a union of strings

0
import pandas as pd
df = pd.DataFrame({
                'cars':['Kia rio', 'Bmw', 'Mercedes', 'Ford', 'Kia', 'Mercedes Benz'],
                'rent':[1, 1, 2, 1, 4, 2],
                'sale':[2, 4, 1, 1, 5, 1],
                'id': [2000, 1000, 3000, 4000, 2000, 3000]
                })
print(df)
df1 = df.drop_duplicates().groupby(['id', 'cars'], sort=False, as_index=False).sum()
print(df1)