5

I'm trying to create a new column in the dataframe called volume. The DF already consists of other columns like market. What I want to do is to group by price and company and then get their count and add it in a new column called volume. Here's what I have:

df['volume'] = df.groupby(['price', 'company']).transform('count')

This does create a new column, however, it's giving me all the rows. I don't need all the rows. For example, before the transformation I would get 4 rows and after the transformation I still get 4 rows but with a new column.

market  company   price    volume
LA      EK        206.0     2
LA      SQ        206.0     1
LA      EK        206.0     2
LA      EK        36.0      3
LA      EK        36.0      3
LA      SQ        36.0      1
LA      EK        36.0      3

I'd like to drop the duplicated rows. Is there a query that I can do with groupby that will only show the rows like so:

market  company   price    volume
LA      EK        206.0     2
LA      SQ        206.0     1
LA      SQ        36.0      1
LA      EK        36.0      3
sacuL
  • 49,704
  • 8
  • 81
  • 106
user2896120
  • 3,180
  • 4
  • 40
  • 100

2 Answers2

9

Simply drop_duplicates with the columns ['market', 'company', 'price']:

>>> df.drop_duplicates(['market', 'company', 'price'])
  market company  price  volume
0     LA      EK  206.0       2
1     LA      SQ  206.0       1
3     LA      EK   36.0       3
5     LA      SQ   36.0       1
sacuL
  • 49,704
  • 8
  • 81
  • 106
1

Your data contains duplicates, probably because you are only including a subset of the columns. You need something else in your data other than price (e.g. two different days could close at the same price, but you wouldn't aggregate the volume from the two).

Assuming that the price is unique for a given timestamp, market and company and you first sort on your timestamp column if any (not required if there is only one price per company and market):

df = pd.DataFrame({
    'company': ['EK', 'SQ', 'EK', 'EK', 'EK', 'SQ', 'EK'],
    'date': ['2018-08-13'] * 3 + ['2018-08-14'] * 4,
    'market': ['LA'] * 7,
    'price': [206] * 3 + [36] * 4})

>>> (df.groupby(['market', 'date', 'company'])['price']
     .agg({'price': 'last', 'volume': 'count'}[['price', 'volume']]
     .reset_index()

  market        date company  price  volume
0     LA  2018-08-13      EK    206       2
1     LA  2018-08-13      SQ    206       1
2     LA  2018-08-14      EK     36       3
3     LA  2018-08-14      SQ     36       1
Alexander
  • 105,104
  • 32
  • 201
  • 196