0

I have three different data frames with basketball players' data.

In all three dataframes there are basketball players' names. I want to join all three dataframes into one EntitySet to use automatic feature generation using featuretools.

As I understand, I need to create an integer key in 3 dataframes, which would be used to join all three dataframes. I understand that the same unique integer ids should be the same for the same players.

How can I create unique integer keys for 3 different datasets, ensuring that the same players have the same ids?

Rikki Tikki Tavi
  • 3,089
  • 5
  • 43
  • 81
  • 2
    Why not join them on the [names](https://stackoverflow.com/questions/23668427/pandas-three-way-joining-multiple-dataframes-on-columns?rq=1)? – uncertainty Jan 20 '23 at 10:04

1 Answers1

0

You do not need to create an integer key to create the relationships. If your names are unique you can simply use them directly in defining the relationships.

import pandas as pd
import featuretools as ft

players = pd.DataFrame({
    "name": ["John", "Jane", "Bill"],
    "date": pd.to_datetime(["2020-01-01", "2020-02-01" ,"2020-03-01"]),
    "other_data": [100, 200, 300]
})
scores = pd.DataFrame({
    "game_id": [0, 1, 2],
    "player": ["John", "John", "Jane"],
    "score": [24, 17, 29]
})

es = ft.EntitySet()
es.add_dataframe(dataframe_name="players", dataframe=players, index="name")
es.add_dataframe(dataframe_name="scores", dataframe=scores, index="game_id")
es.add_relationship("players", "name", "scores", "player")

If your player names are not unique, then you won't be able to create a unique integer id from the names alone. You would have to combine the name with some other piece of information (something like team) to create a new column in your dataframe that uniquely identifies the player in all of your dataframes.

Nate Parsons
  • 376
  • 1
  • 2
  • Thank you. I think the last row should be like this: es.add_relationship("players", "name", "scores", "game_id"), right? How can I combine name with some other piece of info to create a new column? – Rikki Tikki Tavi Jan 21 '23 at 07:45
  • The problem I have is that EntitySet expects one-to-many relations. I have one record with one name in one dataset, but I have many records with the same names in the second dataframe. When I try to do like this: entity_set.add_relationship("players_set", "name", "season_stats", "name") I receive an error that index must be unique. I can aggregate all rows by grouping them by name, but in this case, I won't have one-to-many relation, but one-to-one. What should I do? – Rikki Tikki Tavi Jan 21 '23 at 08:08
  • No, the last line above is how it should be. You want the primary key in one dataframe to link to the foreign key column in the other dataframe. This is defining the one-to-many relationship that would get used for aggregating the scores data back to the players table. Without seeing your specific data, it is difficult to know exactly how you should handle things in this case, but if your data is similar to that I used in my example above, you should do something very much like that. – Nate Parsons Jan 23 '23 at 16:03