1

I am trying to group the groups that share a common element in a Pandas dataframe, my dataframe looks like this (I added the empty lines just for some extra legibility):

ID       REG
15      01497
15      01493

19      01706
19      01706-A

78      05710
78      01738

143     01626
143     01634

144     01626
144     01644

207     05255
207     01638

209     05255
209     03143

I expect to create a new column where I can grop all the IDs that share a same REG number, so it would look like this:

ID     REG          GROUP
15     01497        1
15     01493        1

19     01706        2
19     01706-A      2

78     05710        3
78     01738        3

143    01626        4
143    01634        4

144    01626        4
144    01644        4

207    05255        5
207    01638        5

209    05255        5
209    03143        5

There are some questions that address similar problems like this, but they are not quite the same, (Perhaps Pandas DataFrame Groupby two columns and get counts or How do I find common values within groups in pandas? or Numbering Groups In Pandas DataFrame) but I am not aiming to perform a sum or count in the traditional way.

endive1783
  • 827
  • 1
  • 8
  • 18
Bonjorn
  • 27
  • 5
  • Is this really a well defined problem: There could be a group that shares REG numbers with different groups that themselves don't share a REG number? – Timus Jan 23 '23 at 12:24
  • Well, to get this dataframe I previouly removed the duplicates subsetting by ID and REG, so all the rows should be unique in that respect – Bonjorn Jan 23 '23 at 19:10

3 Answers3

1

One way is to use pandas.Series.duplicated with pandas.Series.cumcum :

df["GROUP"] = (~df["ID"].duplicated() ^ df["REG"].duplicated()).cumsum()

Output :

print(df)

     ID      REG  GROUP
0    15    01497      1
1    15    01493      1
2    19    01706      2
3    19  01706-A      2
4    78    05710      3
5    78    01738      3
6   143    01626      4
7   143    01634      4
8   144    01626      4
9   144    01644      4
10  207    05255      5
11  207    01638      5
12  209    05255      5
13  209    03143      5
Timeless
  • 22,580
  • 4
  • 12
  • 30
  • Unfortunately this doesn't seem to work very well, I a getting a result like this: [link](https://ibb.co/k23rstt) I will keep checking the code in case I missed something though – Bonjorn Jan 23 '23 at 22:35
  • To update my previous comment, this solution actually works, I was checking everything again and my mistake was that I had groups with common values separated by several rows, so the cumulative sum kind of stops 'counting' when the groups with common REG values are far away one from another, still it is agood sugestion for a specific case like the one mentioned in the question – Bonjorn Feb 02 '23 at 17:01
1

With df your dataframe you could try the following:

df["GROUP"] = (
    df.groupby("ID", sort=False)["REG"].transform("first").to_frame()
    .groupby("REG", sort=False).ngroup().add(1)
)
  • First group by column ID and pick the resp. first REG item for each ID-group.
  • Then map the so built new REG column onto group numbers.

Result for the sample in the question:

     ID      REG  GROUP
0    15    01497      1
1    15    01493      1
2    19    01706      2
3    19  01706-A      2
4    78    05710      3
5    78    01738      3
6   143    01626      4
7   143    01634      4
8   144    01626      4
9   144    01644      4
10  207    05255      5
11  207    01638      5
12  209    05255      5
13  209    03143      5
Timus
  • 10,974
  • 5
  • 14
  • 28
  • Continuing my reply in the solution provided by Timeless, this solution works even if the ID groups with common values are spearated by several rows, however it works only if the common REG number is in the firs position within the group, in my original dataframe this is not the case, there are also some cases where the common REG number is in the last position and it doesn't seem possible to place all these common values in the same position. Still, I am marking this answer correct as it has been the most useful so far and I didn't specifiy the sorting problem in the original question – Bonjorn Feb 02 '23 at 17:19
0

Assuming you have a dataframe df with ID as your index, try:

index_lookup = df.index.unique()
df['GROUP'] = df.apply(lambda x: index_lookup.get_loc(x.index), axis=1)
print(df)

if ID is not index, you can either make it the index using:

df = df.set_index('Gender', inplace=True)

or if you do not wish to change the index use:

index_lookup = list(df['Gender'].unique())
df['GROUP'] = df.apply(lambda x : index_lookup.index(x['Gender']), axis=1)

That should work

Galo do Leste
  • 703
  • 5
  • 13