0

I have a dataset with unique names. Another dataset contains several rows with the same names as in the first dataset.

I want to create a column with unique ids in the first dataset and another column in the second dataset with the same ids corresponding to all the same names in the first dataset.

For example:

Dataframe 1:

   player_id Name 
    1        John Dosh
    2        Michael Deesh
    3        Julia Roberts

Dataframe 2:

player_id Name
1         John Dosh
1         John Dosh  
2         Michael Deesh
2         Michael Deesh
2         Michael Deesh
3         Julia Roberts
3         Julia Roberts

I want to do to use both data frames to run deep feature synthesis using featuretools. To be able to do something like this:

entity_set = ft.EntitySet("basketball_players")

entity_set.add_dataframe(dataframe_name="players_set",
                 dataframe=players_set,
                 index='name'
                 )

entity_set.add_dataframe(dataframe_name="season_stats",
                 dataframe=season_stats,
                 index='season_stats_id'
                 )


entity_set.add_relationship("players_set", "player_id", "season_stats", "player_id")
Rikki Tikki Tavi
  • 3,089
  • 5
  • 43
  • 81

1 Answers1

0

This should do what your question asks:

import pandas as pd
df1 = pd.DataFrame([
'John Dosh',
'Michael Deesh',
'Julia Roberts'], columns=['Name'])

df2 = pd.DataFrame([
['John Dosh'],
['John Dosh'],
['Michael Deesh'],
['Michael Deesh'],
['Michael Deesh'],
['Julia Roberts'],
['Julia Roberts']], columns=['Name'])

print('inputs:', '\n')
print(df1)
print(df2)

df1 = df1.reset_index().rename(columns={'index':'id'}).assign(id=df1.index + 1)
df2 = df2.join(df1.set_index('Name'), on='Name')[['id'] + list(df2.columns)]

print('\noutputs:', '\n')
print(df1)
print(df2)

Input/output:

inputs:

            Name
0      John Dosh
1  Michael Deesh
2  Julia Roberts
            Name
0      John Dosh
1      John Dosh
2  Michael Deesh
3  Michael Deesh
4  Michael Deesh
5  Julia Roberts
6  Julia Roberts

outputs:

   id           Name
0   1      John Dosh
1   2  Michael Deesh
2   3  Julia Roberts
   id           Name
0   1      John Dosh
1   1      John Dosh
2   2  Michael Deesh
3   2  Michael Deesh
4   2  Michael Deesh
5   3  Julia Roberts
6   3  Julia Roberts

UPDATE:

An alternative solution which should give the same result is:

df1 = df1.assign(id=list(range(1, len(df1) + 1)))[['id'] + list(df1.columns)]
df2 = df2.merge(df1)[['id'] + list(df2.columns)]
constantstranger
  • 9,176
  • 2
  • 5
  • 19
  • It gives the following error: ValueError: columns overlap but no suffix specified: Index(['Unnamed: 0'], dtype='object') – Rikki Tikki Tavi Jan 21 '23 at 13:02
  • Not seeing that error from `join()` in my environment. Try replacing the line using `join()` with this: `df2 = df2.merge(df1, on='Name')[['id'] + list(df2.columns)]`. – constantstranger Jan 21 '23 at 13:07
  • will I receive a merged dataframe? that's not what I want to do - I still want the dataframes to remain separate. – Rikki Tikki Tavi Jan 21 '23 at 13:09
  • The use of merge is just a way to create a new `id` column in `df2` that has the appropriate values as indicated in `df1`. As you can see from the sample output, the two dataframes have each had an `id` column added, and are otherwise unchanged. – constantstranger Jan 21 '23 at 13:11
  • I receive such an error: "['Unnamed: 0'] not in index" I believe that "['Unnamed: 0'] is a current index in df2. Should I do reset_index? – Rikki Tikki Tavi Jan 21 '23 at 13:12
  • Is that what you see when running the exact code and data in my answer? Which pandas version are you using? – constantstranger Jan 21 '23 at 13:13
  • I have updated my answer to add alternative code, in case there is something related to pandas version causing the logic for adding `id` to `df1` to work differently in your environment. – constantstranger Jan 21 '23 at 13:23