My data contains the names of persons and a list of cities they lived in. I want to group them together following these conditions:
-
first_name
andlast_name
are identical
-
- or (if 1. doesn't hold) their
last_name
are the same and they have lived in at least one identicalcity
.
- or (if 1. doesn't hold) their
The result should be a new column indicating the group id that each person belongs to.
The DataFrame df
looks like this:
>>> df
person_id last_name first_name cities
0 112 Dorsey Nancy [Moscow, New York]
1 113 Harper Max [Munich, Paris, Shanghai]
2 114 Mueller Max [New York, Los Angeles]
3 115 Dorsey Nancy [New York, Miami]
4 116 Harper Maxwell [Munich, Miami]
The new dataframe df_id
should look like this. The order of id
is irrelevant (i.e., which group gets id=1
), but only observations that fulfill either condition 1 or 2 should get the same id
.
>>> df_id
person_id last_name first_name cities id
0 112 Dorsey Nancy [Moscow, New York] 1
1 113 Harper Max [Munich, Paris, Shanghai] 2
2 114 Mueller Max [New York, Los Angeles] 3
3 115 Dorsey Nancy [New York, Miami] 1
4 116 Harper Maxwell [Munich, Miami] 2
My current code:
df= df.reset_index(drop=True)
#explode lists to rows
df_exploded = df.explode('cities')
# define id_counter and dictionary to person_id to id
id_counter = 1
id_matched = dict()
# define id function
def match_id(df):
global id_counter
# check if person_id already matched
if df['person_id'] not in id_matched.keys():
# get all persons with similar names (condition 1)
select = df_expanded[(df_expanded['first_name']==df['first_name']) & df_expanded['last_name']==df['last_name'])]
# get all persons with same last_name and city (condition 2)
if select.empty:
select_2 = df_expanded[(df_expanded['last_name']==df['last_name']) & (df_expanded['cities'] in df['cities'])]
# create new id for this specific person
if select_2.empty:
id_matched[df['person_id']] = id_counter
# create new id for group of person and record in dictionary
else:
select_list = select_2.unique().tolist()
select_list.append(df['person_id'])
for i in select_list:
id_matched[i] = id_counter
# create new id for group of person and record in dictionary
else:
select_list = select.unique().tolist()
select_list.append(df['person_id'])
for i in select_list:
id_matched[i] = id_counter
# set next id
id_counter += 1
# run function
df = df.progress_apply(match_id, axis=1)
# convert dict to DataFrame
df_id_matched = pd.DataFrame.from_dict(id_matched, orient='index', columns['id'])
.rename_axis('person_id').reset_index()
# merge back together with df to create df_id
Does anyone have a more efficient way to perform this task? The data set is huge and it would take several days...
Thanks in advance!