2

I'm struggling to figure this out without writing some hideous loop to go over and check the groups after the groupby and merge them together. I feel like their has to be a way to do this that I just don't know.

What I'm trying to do is group a dataframe by two of its columns, but sometimes there will be combinations of those groups where the keys are flipped (i.e. [key1, key2] will have a group and [key2, key1] will have a group. I actually want to merge the groups from these combinations into a single group.

It is possible to do in a loop after the fact. I have also tried using some methods like this:

unique combinations of values in selected columns in pandas data frame and count

but could not get it to work.

here is a sample of my df:

            Ves-1 type          Ves-2 type    Duration
0                cargo                 tug  898.559993
1     fishing_trawling                 tug  898.559992
2   fishing_transiting                 tug  898.559993
3   fishing_transiting                 tug  898.559993
4                  tug                 tug  898.559992
5                cargo                 tug  898.560002
6                cargo                 tug  898.560002
7            passenger                 tug  907.200008
8             pleasure                 tug  898.560003
9                cargo                 tug  898.559993
10               cargo                 tug  898.559993
11               cargo  fishing_transiting  898.560002
12               cargo  fishing_transiting  898.559993
13               cargo  fishing_transiting  898.560002
14                 tug  fishing_transiting  898.560003
15               cargo  fishing_transiting  907.200008
16               cargo  fishing_transiting  907.200008
17                 tug  fishing_transiting  898.560002
18               cargo  fishing_transiting  898.560002
19  fishing_transiting  fishing_transiting  898.559993

if I just do a simple groupby using both Ves columns:

>>> test.groupby(['Ves-1 type','Ves-2 type'])['Duration'].agg(list)
Ves-1 type          Ves-2 type
cargo               fishing_transiting    [898.560002, 898.5599930000001, 898.560002, 90...
                    tug                   [898.5599930000001, 898.560002, 898.560002, 89...
fishing_transiting  fishing_transiting                                  [898.5599930000001]
                    tug                              [898.5599930000001, 898.5599930000001]
fishing_trawling    tug                                                 [898.5599920000001]
passenger           tug                                                        [907.200008]
pleasure            tug                                                        [898.560003]
tug                 fishing_transiting                             [898.560003, 898.560002]
                    tug                                                 [898.5599920000001]

the problem is now I have a fishing_transiting/tug combo and a tug/fishing_transiting combo... is there any way to have these groupes merged together?

EDIT - another workaround I tried that works, but would like to know if there's a way to just handle this in groupby:

>>> test['key'] = list(zip(test['Ves-1 type'].values, test['Ves-2 type'].values))
>>> test['key'] = test['key'].apply(sorted).astype(str)
>>> test.groupby('key')['Duration'].agg(list)
key
['cargo', 'fishing_transiting']                 [898.560002, 898.5599930000001, 898.560002, 90...
['cargo', 'tug']                                [898.5599930000001, 898.560002, 898.560002, 89...
['fishing_transiting', 'fishing_transiting']                                  [898.5599930000001]
['fishing_transiting', 'tug']                   [898.5599930000001, 898.5599930000001, 898.560...
['fishing_trawling', 'tug']                                                   [898.5599920000001]
['passenger', 'tug']                                                                 [907.200008]
['pleasure', 'tug']                                                                  [898.560003]
['tug', 'tug']                                                                [898.5599920000001]
Derek Eden
  • 4,403
  • 3
  • 18
  • 31

1 Answers1

3

Let's sort the values in the columns Ves-1 type and Ves-2 type along axis=1 then groupby the dataframe on these sorted columns and agg Duration using list:

c = ['Ves-1 type', 'Ves-2 type']
df.groupby(np.sort(df[c], axis=1).T.tolist())['Duration'].agg(list)

cargo               fishing_transiting    [898.5600019999999, 898.559993, 898.5600019999...
                    tug                   [898.559993, 898.5600019999999, 898.5600019999...
fishing_transiting  fishing_transiting                                         [898.559993]
                    tug                   [898.559993, 898.559993, 898.5600029999999, 89...
fishing_trawling    tug                                                        [898.559992]
passenger           tug                                                        [907.200008]
pleasure            tug                                                 [898.5600029999999]
tug                 tug                                                        [898.559992]
Name: Duration, dtype: object
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53