2

I am working with a big dataset (2M+ rows) that looks like the following:

Id  TeamId  UserId
43  504     722
44  504     727
45  601     300
46  602     722
47  602     727
48  605     300
49  777     300
50  777     301
51  788     400
52  789     400
53  100     727

In this case, TeamId 504 and 602 are the same, 601 matches with 605 but not with 777 (because it has one more person in the team).

My goal is to generate unique IDs for each "unique" team:

Id  TeamId  UserId  UniqueId
43  504     722     0
44  504     727     0
45  601     300     1
46  602     722     0
47  602     727     0
48  605     300     1
49  777     300     2
50  777     301     2
51  788     400     3
52  789     400     3
53  100     727     4

A person can be in a team of 1, like in the case of UserId 727: he's part of team 504 (with UserId 722) and of team 100 (alone). This should generate 2 different unique ids for the two teams.

I cannot groupBy by TeamId only as it will detect TeamId 504 and 602 as different teams, nor I can by UserId because it will not keep track of the teams.

From my understanding, this might be a network problem. I have found a similar query to this here: Groupby two column values and create a unique id

How can I achieve this? Any help would be appreciated.

3 Answers3

0

For each row create a new variable (maybe a tuple) that have the members of that team.

Id  TeamId  UserId  NewVar
43  504     722     (722, 727)
44  504     727     (722, 727)
45  601     300     (300)
46  602     722     (722, 727)
47  602     727     (722, 727)
48  605     300     (300)
49  777     300     (300, 301)
50  777     301     (300, 301)
51  788     400     (400)
52  789     400     (400)
53  100     727     (727)

after this step compare the NewVar and assign the id Ps: don't forget to order the NewVar

Jose Macedo
  • 308
  • 1
  • 11
0

You can use pivot_table to get in index TeamId and in columns UserId, each row showing which users are in each team, such as:

dfp = df.pivot_table( values='Id', index='TeamId', columns='UserId', 
                      aggfunc=np.any, fill_value=False)
print (dfp)                            
UserId    300    301    400    722    727
TeamId                                   
100     False  False  False  False   True
504     False  False  False   True   True
601      True  False  False  False  False
602     False  False  False   True   True
605      True  False  False  False  False
777      True   True  False  False  False
788     False  False   True  False  False
789     False  False   True  False  False

Then to be able to get the UniqueId, you can sort_values by all columns, use the diff between two rows, find if any per rows meaning of different groups and cumsum such as:

print (dfp.sort_values(dfp.columns.tolist()).diff().any(1).cumsum())
TeamId
100    0
504    1 #same number for 504 and 602 but not 100 as you want
602    1
788    2
789    2
601    3
605    3
777    4
dtype: int64

so to get the new column, you can use map:

df['UniqueId'] = df.TeamId.map(dfp.sort_values(dfp.columns.tolist())
                                  .diff().abs().any(1).cumsum())
print (df)
    Id  TeamId  UserId  UniqueId
0   43     504     722         1
1   44     504     727         1
2   45     601     300         3
3   46     602     722         1
4   47     602     727         1
5   48     605     300         3
6   49     777     300         4
7   50     777     301         4
8   51     788     400         2
9   52     789     400         2
10  53     100     727         0
Ben.T
  • 29,160
  • 6
  • 32
  • 54
0

Use 2 groupby to have the result:

import pandas as pd

df = pd.DataFrame( {'Id'    :[43,44,45,46,47,48,49,50,51,52,53],
                    'TeamId':[504,504,601,602,602,605,777,777,788,789,100],
                    'UserId':[722,727,300,722,727,300,300,301,400,400,727]})

df_grouped = df.groupby('TeamId')['UserId'].apply(tuple).to_frame().reset_index()

df_grouped = df_grouped.groupby('UserId')['TeamId'].apply(tuple).to_frame().reset_index()

print(df_grouped)

result:

       UserId      TeamId
0      (300,)  (601, 605)
1  (300, 301)      (777,)
2      (400,)  (788, 789)
3  (722, 727)  (504, 602)
4      (727,)      (100,)

just iterate the TeamId column to set the team number...

Frenchy
  • 16,386
  • 3
  • 16
  • 39