0

I have a raw dataframe that looks like

        codcet  placa_encoded   date   time_seconds velocidade
5031 490191222  431.0      2021-03-11   70079.0      51
5032 490221211  431.0      2021-03-11   72219.0      55
7991 490361213  562.0      2021-03-11   28559.0      24
7992 490361232  562.0      2021-03-11   29102.0      29              
7993 490361221  562.0      2021-03-11   30183.0      33                 
...

Where the numbers on the far left are indexes from the original dataset. My goal is to convert this into a dataframe indexed by placa_encoded and by n, a counter within each group that then looks like

placa_encoded  n    time_seconds  velocidade            codcet    
431.0          0      70079.0           51              490191222   
431.0          1      72219.0           55              490221211   
562.0          0      28559.0           24              490361213   
562.0          1      29102.0           29              490361232
562.0          2      30183.0           33              490361221   

That is, I aim to groupby('placa_encoded') then add another column n that counts the position within each group. The row should be indexed by both placa_encoded and n. I think I can use cumcount() to do this but it's unclear to me how to add it as a column since groupby doesn't product a dataframe I can assign to. I looked at this question but it seems they use .count() to convert it to a dataframe, and I want to preserve the data instead of getting any counts. I also tried to use pd.DataFrame(gbplaca) and pd.DataFrame(gbplaca.groups) to no avail.

Thank you so much!

Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
Tanishq Kumar
  • 263
  • 1
  • 13

2 Answers2

1

I think this is what you are wanting

df['n'] = df.sort_values(['time_seconds'], ascending=True).groupby(['placa_encoded']).cumcount()
df = df.set_index(['placa_encoded', 'n'])
df

This is a multi-index with 'placa_encoded' and 'n'

ArchAngelPwn
  • 2,891
  • 1
  • 4
  • 17
  • I updated my answer to make it a multi-index. Is that what you were looking for? – ArchAngelPwn Jun 10 '22 at 16:13
  • Yes, thanks! In general, should we always operate on the original dataframe and use `groupby.agg()` type methods to create new columns we can add to the original dataframe? What if I had wanted to make a dataframe where the rows were grouped by/clumped by, say, `velocidade`, instead? – Tanishq Kumar Jun 10 '22 at 16:14
  • Its difficult to say as that is kind of a general question, every situation is different so its outside the bounds of a SO comment (not trying to be evasive just a kind of big question) in this situation though if you wanted to do `velocidade` instead you would just replace the `placa_encoded'` from the code above...does that answer your question (or at least sort of) – ArchAngelPwn Jun 10 '22 at 16:17
  • Well, the original dataframe already has rows grouped by `placa_encoded`, but not by `velocidad`. That's what I'm asking -- how would we have done the same if the dataframe wasn't grouped by `placa_encoded`. We can't just run `groupby('placa_encoded')` then the same commands, because `groupby` doesn't return a dataframe we can run `set_index` on, but instead a `DataFrameGroupBy` object. – Tanishq Kumar Jun 10 '22 at 16:24
  • Oh so you got a dataframe doing a groupby on `velocidad` then from that dataframe you did another group by using `placa_encoded`? – ArchAngelPwn Jun 10 '22 at 16:29
0

All you have to do is to set the column in df (named 'n') with the result from df.groupby('placa_encoded').cumcount():

    df.loc[:, 'n'] = df.groupby('placa_encoded').cumcount()
    print(df)
           codcet  placa_encoded        date  time_seconds  velocidade  n
     0  490191222          431.0  2021-03-11       70079.0          51  0
     1  490221211          431.0  2021-03-11       72219.0          55  1
     2  490361213          562.0  2021-03-11       28559.0          24  0
     3  490361232          562.0  2021-03-11       29102.0          29  1
     4  490361221          562.0  2021-03-11       30183.0          33  2

Then, you can set the multi-index of df to ['placa_encoded', 'n'] if you want:

    df = df.set_index(['placa_encoded', 'index'])
    print(df)
                         codcet        date  time_seconds  velocidade
     placa_encoded n                                                 
     431.0         0  490191222  2021-03-11       70079.0          51
                   1  490221211  2021-03-11       72219.0          55
     562.0         0  490361213  2021-03-11       28559.0          24
                   1  490361232  2021-03-11       29102.0          29
                   2  490361221  2021-03-11       30183.0          33
aichao
  • 7,375
  • 3
  • 16
  • 18