3

Let's say I have the following 3 dataframes:

import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import io
import csv
import itertools
import xlsxwriter

df1 = pd.DataFrame(np.array([
    [1010667747, 'Suzhou', 'Suzhou IFS'],
    [1010667356, 'Shenzhen', 'Kingkey 100'],
    [1010667289, 'Wuhan', 'Wuhan Center']]),
    columns=['id', 'city', 'name']
)    
df2 = pd.DataFrame(np.array([
    [190010, 'Shenzhen', 'Ping An Finance Centre'],
    [190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
    [190015, 'Beijing', 'China Zun']]),
    columns=['id', 'city', 'name']
)    
df3 = pd.DataFrame(np.array([
    ['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
    ['ZY-15', 'Hong Kong', 'International Commerce Centre'],
    ['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
    columns=['id', 'city', 'name']
)

I would like to find similar building names by calculating their similarity ratio using fuzzywuzzy package, here is my solution which need to improve:

First, I concatenate all three dataframes to one column as full_name. At this step, in fact, I shouldn't add id to full_name but in order to better distingue building names from different dataframes, I have added it:

df1['full_name'] = df1['id'].apply(str) + '_' + df1['city'] + '_' + df1['name']
df2['full_name'] = df2['id'].apply(str) + '_' + df2['city'] + '_' + df2['name']
df3['full_name'] = df3['id'].apply(str) + '_' + df3['city'] + '_' + df3['name']

df4 = df1['full_name'] 
df5 = df2['full_name'] 
df6 = df3['full_name'] 

frames = [df4, df5, df6]
df = pd.concat(frames)

df.columns = ["full_name"]
df.to_excel('concated_names.xlsx', index = False)

Second, I iterate all full_names and compare with each others to get similarity_ratio of each pairs of building names:

df = pd.read_excel('concated_names.xlsx')
projects = df.full_name.tolist()

processedProjects = []
matchers = []

threshold_ratio = 10

for project in projects:
    if project:
        processedProject = fuzz._process_and_sort(project, True, True)
        processedProjects.append(processedProject)
        matchers.append(fuzz.SequenceMatcher(None, processedProject))

with open('output10.csv', 'w', encoding = 'utf_8_sig') as f1:
    writer = csv.writer(f1, delimiter=',', lineterminator='\n', )
    writer.writerow(('name', 'matched_name', 'similarity_ratio'))

    for project1, project2 in itertools.combinations(enumerate(processedProjects), 2):
        matcher = matchers[project1[0]]
        matcher.set_seq2(project2[1])
        ratio = int(round(100 * matcher.ratio()))
        if ratio >= threshold_ratio:
            #print(projects[project1[0]], projects[project2[0]])
            my_list = projects[project1[0]], projects[project2[0]], ratio
            print(my_list)
            writer.writerow(my_list)

my_list result:

('1010667747_Suzhou_Suzhou IFS', '1010667356_Shenzhen_Kingkey 100', 44)
('1010667747_Suzhou_Suzhou IFS', '1010667289_Wuhan_Wuhan Center', 49)
('1010667747_Suzhou_Suzhou IFS', '190010_Shenzhen_Ping An Finance Centre', 33)
('1010667747_Suzhou_Suzhou IFS', '190012_Guangzhou_Guangzhou CTF Finance Centre', 47)
......

At last step, I split output10.csv manually in Excel and get my final expected result like this (it will be better if I have dataframe source for each building):

           id    city        name  matched_id matched_name  \
0  1010667747  Suzhou  Suzhou IFS  1010667356     Shenzhen   
1  1010667747  Suzhou  Suzhou IFS  1010667289        Wuhan   
2  1010667747  Suzhou  Suzhou IFS      190010     Shenzhen   
3  1010667747  Suzhou  Suzhou IFS      190012    Guangzhou   
4  1010667747  Suzhou  Suzhou IFS      190015      Beijing   

                 matched_name.1  similarity_ratio  
0                   Kingkey 100                44  
1                  Wuhan Center                49  
2        Ping An Finance Centre                33  
3  Guangzhou CTF Finance Centre                47  
4                     China Zun                27  

How could I get final expected result in a way more efficient in Python? Thanks.

ah bon
  • 9,293
  • 12
  • 65
  • 148

1 Answers1

1

try this solution: i am using numpy and itertools to speed up and simplify the coding and no need to use excel file...

import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd

   :
   :

frames = [pd.DataFrame(df4), pd.DataFrame(df5), pd.DataFrame(df6)]
df = pd.concat(frames).reset_index(drop=True)

dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
df1 = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())

#create of list of dataframes (each row id dataframe)
listOfDfs = [df1.loc[idx] for idx in np.split(df1.index, df.shape[0])]

#in dictionary, you have a Dataframe by name wich contains all ratios from other names
DataFrameDict = {df['full_name'][i]: listOfDfs[i] for i in range(df1.shape[0])}

for name in DataFrameDict.keys():
    print(name)
    #print(DataFrameDict[name]
Frenchy
  • 16,386
  • 3
  • 16
  • 39
  • Thanks. Please write to excel if you like: `df = pd.DataFrame(list(DataFrameDict.items())) df.to_excel('test.xlsx', index = False)` – ah bon Mar 26 '19 at 03:01