1

My data contains the names of persons and a list of cities they lived in. I want to group them together following these conditions:

    1. first_name and last_name are identical
    1. or (if 1. doesn't hold) their last_name are the same and they have lived in at least one identical city.

The result should be a new column indicating the group id that each person belongs to.
The DataFrame df looks like this:

>>> df
      person_id  last_name first_name                     cities
    0       112     Dorsey      Nancy         [Moscow, New York]
    1       113     Harper        Max  [Munich, Paris, Shanghai]
    2       114    Mueller        Max    [New York, Los Angeles]
    3       115     Dorsey      Nancy          [New York, Miami]
    4       116     Harper    Maxwell            [Munich, Miami]

The new dataframe df_id should look like this. The order of id is irrelevant (i.e., which group gets id=1), but only observations that fulfill either condition 1 or 2 should get the same id.

>>> df_id
      person_id  last_name first_name                     cities  id
    0       112     Dorsey      Nancy         [Moscow, New York]   1
    1       113     Harper        Max  [Munich, Paris, Shanghai]   2
    2       114    Mueller        Max    [New York, Los Angeles]   3
    3       115     Dorsey      Nancy          [New York, Miami]   1
    4       116     Harper    Maxwell            [Munich, Miami]   2

My current code:

df= df.reset_index(drop=True)
#explode lists to rows
df_exploded = df.explode('cities')
# define id_counter and dictionary to person_id to id
id_counter = 1
id_matched = dict()

# define id function
def match_id(df):
  global id_counter

  # check if person_id already matched
  if df['person_id'] not in id_matched.keys():
  # get all persons with similar names (condition 1)
    select = df_expanded[(df_expanded['first_name']==df['first_name']) & df_expanded['last_name']==df['last_name'])]
    # get all persons with same last_name and city (condition 2)
    if select.empty:
      select_2 = df_expanded[(df_expanded['last_name']==df['last_name']) & (df_expanded['cities'] in df['cities'])]
      # create new id for this specific person
      if select_2.empty:
        id_matched[df['person_id']] = id_counter
      # create new id for group of person and record in dictionary
      else:
        select_list = select_2.unique().tolist()
        select_list.append(df['person_id'])
        for i in select_list:
          id_matched[i] = id_counter 
    # create new id for group of person and record in dictionary
    else:
      select_list = select.unique().tolist()
      select_list.append(df['person_id'])
      for i in select_list:
        id_matched[i] = id_counter
    # set next id
    id_counter += 1

# run function
df = df.progress_apply(match_id, axis=1)

# convert dict to DataFrame
df_id_matched = pd.DataFrame.from_dict(id_matched, orient='index', columns['id'])
                            .rename_axis('person_id').reset_index()

# merge back together with df to create df_id

Does anyone have a more efficient way to perform this task? The data set is huge and it would take several days...

Thanks in advance!

2 Answers2

1
  • cond1 set id with same first_name + last_name
df['name'] = df['first_name'].fillna('') + '_' + df['last_name'].fillna('')
obj_name = df['name'].unique()
name_map = dict(zip(obj_name, np.arange(1, len(obj_name)+1)))
df['tag'] = df['name'].map(name_map)
  • cond2, find the same last_name with same city
obj_last_name_city = df[['last_name', 'cities']].explode('cities')
cond = obj_last_name_city.duplicated(keep=False)
objn = obj_last_name_city[cond].reset_index().groupby(['last_name','cities'])['index'].agg(list)

# sort_values is import if there one last_name with n same cities or reverse.
for i in objn.sort_values().tolist():
     df.loc[i[1:], 'tag'] = df.loc[i[0], 'tag']

# print(objn)

#     last_name  cities  
#     Dorsey     New York    [0, 3]
#     Harper     Munich      [1, 4]
#     Name: index, dtype: object

output:

   person_id last_name first_name                     cities  tag
0        112    Dorsey      Nancy         [Moscow, New York]    1
1        113    Harper        Max  [Munich, Paris, Shanghai]    2
2        114   Mueller        Max    [New York, Los Angeles]    3
3        115    Dorsey      Nancy          [New York, Miami]    1
4        116    Harper    Maxwell            [Munich, Miami]    2
Ferris
  • 5,325
  • 1
  • 14
  • 23
  • Thanks, based on a first check on the large data the code works well! How come condition 2 doesn't overwrite condition 1? Could you please explain? – hunter-gatherers Jun 29 '21 at 08:23
  • 1
    cond2 will overwrite cond1, so at last, you should sort_values of `objn` to assign the values starting from the min tag. – Ferris Jun 29 '21 at 08:27
  • I think there will be a more efficient way to handle the loop `for i in objn.sort_values().tolist()`. – Ferris Jun 29 '21 at 08:29
1
import networkx 
from networkx.algorithms.components.connected import connected_components
def to_graph(l):
    G = networkx.Graph()
    for part in l:
        # each sublist is a bunch of nodes
        G.add_nodes_from(part)
        # it also imlies a number of edges:
        G.add_edges_from(to_edges(part))
    return G

def to_edges(l):
    """ 
        treat `l` as a Graph and returns it's edges 
        to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
    """
    it = iter(l)
    last = next(it)

    for current in it:
        yield last, current
        last = current    
  • Start:
df = pd.DataFrame(
    [{'person_id': 112, 'last_name': 'Dorsey', 'first_name': 'Nancy', 'cities': ['Moscow', 'New York'], 'id': 1},
     {'person_id': 113, 'last_name': 'Harper', 'first_name': 'Max', 'cities': ['Munich', 'Paris', 'Shanghai'], 'id': 2},
     {'person_id': 114, 'last_name': 'Mueller', 'first_name': 'Max', 'cities': ['New York', 'Los Angeles'], 'id': 3}, 
     {'person_id': 115, 'last_name': 'Dorsey', 'first_name': 'Nancy', 'cities': ['New York', 'Miami'], 'id': 1},
     {'person_id': 116, 'last_name': 'Harper', 'first_name': 'Maxwell', 'cities': ['Munich', 'Miami'], 'id': 2}, 
     {'person_id': 117, 'last_name': 'Harper', 'first_name': 'Jan', 'cities': ['Miami'], 'id': 2}
    ]
)
df = df.reset_index()

# cond1
obj_name = df['first_name'].fillna('') + '_' + df['last_name'].fillna('')
obj1 = df.groupby(obj_name)['index'].agg(list).to_list()

# cond2
obj_last_name_city = df[['last_name', 'cities']].explode('cities')
cond = obj_last_name_city.duplicated(keep=False)
obj2 = obj_last_name_city[cond].reset_index().groupby(['last_name','cities'])['index'].agg(list).to_list()

L =(obj1 + obj2)
# [[1], [2], [4], [0, 3], [0, 3], [1, 4]]

# use networkx
G = to_graph(L)
obj = pd.Series(connected_components(G))
obj = obj.map(list).sort_values()

idx_map = (obj.reset_index(drop=True)
           .explode()
           .reset_index()
           .set_index(0)['index']
           .to_dict())

df['_id'] = df['index'].map(idx_map) + 1
df
Ferris
  • 5,325
  • 1
  • 14
  • 23