1

I have a tropical cyclone dataset like this: TC dataset

I want to group the dataset according to the column 'TC_name', i.e. one TC one group, but 'TC_name' may be identical for different years and the same TC may travel interannually, so groupby() maybe doesn't work.

I also tried to loop line by line and find where the 'TC_name' changes, cut the dataset there to get a dataframe for each TC. However, this gives me over 1000 dataframes and I don't know how to represent this 2D information in a 3D formation.

I guess multi index might work through assigning a high level index to each TC, but I cannot reach it after trying many times.

An example is provided below for clarification: (I need seperate 3 tropical cyclones)

import pandas as pd
data = {'TC_name':['RITA','RITA','IDA','IDA','IDA','IDA','RITA','RITA','RITA'],
        'Year':[2020,2020,2020,2020,2020,2021,2021,2021,2021]}
df = pd.DataFrame(data)
df
    TC_name Year
0   RITA    2020
1   RITA    2020
2   IDA     2020
3   IDA     2020
4   IDA     2020
5   IDA     2021
6   RITA    2021
7   RITA    2021
8   RITA    2021

And it would be better to get this format, or any other efficient 3D data structure representation.

         TC_name    Year
   1   1    RITA    2020
       2    RITA    2020
   2   1    IDA     2020
       2    IDA     2020
       3    IDA     2020
       4    IDA     2021
   3   1    RITA    2021
       2    RITA    2021
       3    RITA    2021

Thanks a lot for your help!

Feng Hu
  • 63
  • 6
  • Why does `RITA (2020)` get number 1 assigned and `IDA (2020)` number 2 and not vice-versa? Is there some kind of implied order in the dataset? – notiv Sep 22 '22 at 17:41
  • The data is initially sorted by the time of occurrence. Actually, the order is not important. I just give a possible output by which I can access one TC track quickly. – Feng Hu Sep 22 '22 at 17:53

1 Answers1

1

Edit

Here another possibility. We reset the index to keep the original sorting:

df = df.reset_index()

Then we assign a new id within each TC_name group:

df['id2'] = df.groupby('TC_name')['index'].transform(lambda x: pd.factorize(x)[0] + 1)

Here the intermediate output:

index TC_name Year id2
0 0 RITA 2020 1
1 1 RITA 2020 2
2 2 IDA 2020 1
3 3 IDA 2020 2
4 4 IDA 2020 3
5 5 IDA 2021 4

Then we assign a new id every time the TC_name changes:

df['id'] = (df["TC_name"] != df["TC_name"].shift(1)).cumsum()

And finally we drop the index and set the multiindex:

df = df.drop('index', axis=1).set_index(['id', 'id2'])

Here the final output, which looks like yours:

TC_name Year
1 1 RITA 2020
2 RITA 2020
2 1 IDA 2020
2 IDA 2020
3 IDA 2020
4 IDA 2021
3 3 RITA 2021
4 RITA 2021
5 RITA 2021
notiv
  • 441
  • 5
  • 12
  • Thank you very much for your answer. However, please also note that there are `IDA(2020)` and `IDA(2021)`, so `groupby` is not applicable for this issue. Any other insights on how to split this dataset are also appreciated. – Feng Hu Sep 22 '22 at 18:09
  • In fact, I tried `groupby()` before asking. Adding `sort = False` in `groupby` will give the output in the original sort, which is almost there. – Feng Hu Sep 22 '22 at 18:16
  • I edited the answer to provide another way. Do you need a separate sorting for the second level of the index (reset the "counter")? – notiv Sep 22 '22 at 18:22
  • Thanks, but there are actually only 3 groups. Although `IDA` begins in 2020 and ends in 2021, it is still 'one' cyclone. – Feng Hu Sep 22 '22 at 18:33
  • I think I have it (finally) :) – notiv Sep 22 '22 at 19:06
  • 1
    This is the first time I use Stackoverflow and I am just a beginner in python programming. Really cannot believe I find a fantastic platform to support my programming study. Thanks again. Have a nice day! :) – Feng Hu Sep 22 '22 at 19:33