I have a file with x number of string names and their associated IDs. Essentially two columns of data.
What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.
This is sort of what I had in mind. Just to show my intent:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.read_csv('random_data_file.csv')
df = df[['ID','String']]
df['String_Dup'] = df['String'] #creating duplicate of data in question
df = df.set_index('ID')
df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())
But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.
I hope my issue is clearly worded, and really, any input is appreciated,