Questions tagged [rapidfuzz]

RapidFuzz is a library to perform fuzzy string matching in Python and C++

16 questions
2
votes
1 answer

How to do effective matrix computation and not get memory overload for similarity scoring?

I have the following code for similarity scoring: from rapidfuzz import process, fuzz import pandas as pd d_test = { 'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'], 'cluster_number' : [1, 2, 3, 3, 2, 1, 4,…
illuminato
  • 1,057
  • 1
  • 11
  • 33
2
votes
1 answer

How to set a column value by fuzzy string matching with another dataframe?

I have referred to this post but cannot get it to run for my particular case. I have two dataframes: import pandas as pd df1 = pd.DataFrame( { "ein": {0: 1001, 1: 1500, 2: 3000}, "ein_name": {0: "H for Humanity", 1: "Labor…
2
votes
1 answer

Rapidfuzz match merge

Very new to this, would appreciate any advice on the following: I have a dataset 'Projects' showing list of institutions with project IDs: project_id institution_name 0 somali national university 1 aarhus university 2 …
1
vote
1 answer

Apply Levenshtein distance from rapidfuzz.distance to dataframe with two columns

I have a csv file that looks as follows: ID; name1; name2 1; John Doe; John Does 2; Mike Johnson; Mike Jonson 3; Leon Mill; Leon Miller 4; Jack Jo; Jack Joe Now I want to calculate the Levenshtein distance for each pair of name. So compare "John…
PSt
  • 97
  • 11
1
vote
1 answer

optimizing RapidFuzz for a list with large number of elements (e.g. 200,000)

I would like to run this piece of rapidfuzz code mentioned in this post on a list with 200,000 elements. I am wondering what's the best way to optimize this for a faster run on GPU? Find fuzzy match string in a list with matching string value and…
nerd
  • 473
  • 5
  • 15
1
vote
1 answer

Fuzzy Matching with different fuzz ratios

I have two large datasets. df1 is about 1m lines, and df2 is about 10m lines. I need to find matches for lines in df1 from df2. I have posted an original version of this question separately. See here. Well answered by @laurent but I have some…
1
vote
1 answer

Pandas fast fuzzy match

I have two data frames with the following format: d = {'id2': ['1', '2'], 'name': ['paris city', 'london town']} df1 = pd.DataFrame(data=d) print(df1) id2 name 0 1 paris city 1 1 london town d = {'id2':…
Mustard Tiger
  • 3,520
  • 8
  • 43
  • 68
1
vote
2 answers

Is there a way to modify this code to reduce run time?

so I am looking to modify this code to reduce runtime of fuzzywuzzy library. At present, it's taking about an hour for a dataset with 800 rows, and when I used this on a dataset with 4.5K rows, it kept running for almost 6 hours, still no result. I…
0
votes
0 answers

optimizing RapidFuzz for a large number of elements and obtaining match score

Following this answer I am also trying to obtain the string match score between two lists. What would be the best way of doing that? elements = pd.DataFrame({'name':['vikash', 'vikas', 'Vinod', 'Vikky', 'Akash', 'Vinodh', 'Sachin', 'Salman', 'Ajay',…
0
votes
0 answers

How to do fuzzymatching on nested subsets of a dataframe?

I have a dataframe with columns: state, county, and agency_name, and I want to do fuzzy matching on the agency name to another dataframe that has more variables about agency names. But i want to only fuzzy match names within the same state and…
dave
  • 31
  • 1
  • 2
0
votes
2 answers

How to make fuzzy search between lists showing matches and not found elements?

I'm trying to make a fuzzy match for the values in list to_search. Search each value in to_search within choices list and show the corresponding item from result list. Like a MS Excel VLookUp, but with fuzzy search. This is my current code that…
Rasec Malkic
  • 373
  • 1
  • 8
0
votes
1 answer

Is there a way to speed up matching addresses and level of confidence per match between two data frames for large datasets?

I have got a script below that check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match. I am using rapid fuzz I heard it is faster than fuzzywuzzy.…
Kelly Tang
  • 19
  • 5
0
votes
1 answer

Using rapidfuzz on a dataframe

I have 4 columns which are BuisnessID, Name, BuisnessID_y, Name_y and I want to match Name with Name_y with a 90% similarity score, and if not 90% then drop those rows. Sample input df BusinessID NAME BusinessID_y NAME_y 1013120869 …
0
votes
1 answer

Why is the token set ratio so low using fuzzywuzzy?

I am using fuzzywuzzy and rapidfuzz to find names mentioned in comments. I read through the documentation of the "token_set_ratio" function but I still don't understand the following: # I preprocessed the comments to remove stop words and commonly…
0
votes
0 answers

Python: TypeError: can't pickle module objects multiprocessing on Jupyter Notebook

I am sorry that my code might look confusing, but what it does is that it reads in 300,000 items and try to cross-reference them to another file. (It tries to find the best match of the item descriptions from another file). I know that the library…
1
2