RapidFuzz is a library to perform fuzzy string matching in Python and C++
Questions tagged [rapidfuzz]
16 questions
2
votes
1 answer
How to do effective matrix computation and not get memory overload for similarity scoring?
I have the following code for similarity scoring:
from rapidfuzz import process, fuzz
import pandas as pd
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4,…

illuminato
- 1,057
- 1
- 11
- 33
2
votes
1 answer
How to set a column value by fuzzy string matching with another dataframe?
I have referred to this post but cannot get it to run for my particular case. I have two dataframes:
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor…

Umar Boodoo
- 69
- 6
2
votes
1 answer
Rapidfuzz match merge
Very new to this, would appreciate any advice on the following:
I have a dataset 'Projects' showing list of institutions with project IDs:
project_id institution_name
0 somali national university
1 aarhus university
2 …

StrangeBadger
- 43
- 5
1
vote
1 answer
Apply Levenshtein distance from rapidfuzz.distance to dataframe with two columns
I have a csv file that looks as follows:
ID; name1; name2
1; John Doe; John Does
2; Mike Johnson; Mike Jonson
3; Leon Mill; Leon Miller
4; Jack Jo; Jack Joe
Now I want to calculate the Levenshtein distance for each pair of name. So compare "John…

PSt
- 97
- 11
1
vote
1 answer
optimizing RapidFuzz for a list with large number of elements (e.g. 200,000)
I would like to run this piece of rapidfuzz code mentioned in this post on a list with 200,000 elements. I am wondering what's the best way to optimize this for a faster run on GPU?
Find fuzzy match string in a list with matching string value and…

nerd
- 473
- 5
- 15
1
vote
1 answer
Fuzzy Matching with different fuzz ratios
I have two large datasets. df1 is about 1m lines, and df2 is about 10m lines. I need to find matches for lines in df1 from df2.
I have posted an original version of this question separately. See here. Well answered by @laurent but I have some…

Umar Boodoo
- 69
- 6
1
vote
1 answer
Pandas fast fuzzy match
I have two data frames with the following format:
d = {'id2': ['1', '2'], 'name': ['paris city', 'london town']}
df1 = pd.DataFrame(data=d)
print(df1)
id2 name
0 1 paris city
1 1 london town
d = {'id2':…

Mustard Tiger
- 3,520
- 8
- 43
- 68
1
vote
2 answers
Is there a way to modify this code to reduce run time?
so I am looking to modify this code to reduce runtime of fuzzywuzzy library. At present, it's taking about an hour for a dataset with 800 rows, and when I used this on a dataset with 4.5K rows, it kept running for almost 6 hours, still no result. I…

Shrumo
- 47
- 7
0
votes
0 answers
optimizing RapidFuzz for a large number of elements and obtaining match score
Following this answer I am also trying to obtain the string match score between two lists. What would be the best way of doing that?
elements = pd.DataFrame({'name':['vikash', 'vikas', 'Vinod', 'Vikky', 'Akash', 'Vinodh', 'Sachin', 'Salman', 'Ajay',…

RoyalPotatoe
- 13
- 2
0
votes
0 answers
How to do fuzzymatching on nested subsets of a dataframe?
I have a dataframe with columns: state, county, and agency_name, and I want to do fuzzy matching on the agency name to another dataframe that has more variables about agency names. But i want to only fuzzy match names within the same state and…

dave
- 31
- 1
- 2
0
votes
2 answers
How to make fuzzy search between lists showing matches and not found elements?
I'm trying to make a fuzzy match for the values in list to_search. Search each value in to_search within
choices list and show the corresponding item from result list. Like a MS Excel VLookUp, but with fuzzy search.
This is my current code that…

Rasec Malkic
- 373
- 1
- 8
0
votes
1 answer
Is there a way to speed up matching addresses and level of confidence per match between two data frames for large datasets?
I have got a script below that check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match.
I am using rapid fuzz I heard it is faster than fuzzywuzzy.…

Kelly Tang
- 19
- 5
0
votes
1 answer
Using rapidfuzz on a dataframe
I have 4 columns which are BuisnessID, Name, BuisnessID_y, Name_y and I want to match Name with Name_y with a 90% similarity score, and if not 90% then drop those rows. Sample input
df
BusinessID NAME BusinessID_y NAME_y
1013120869 …

Sarthak Gupta
- 7
- 1
- 4
0
votes
1 answer
Why is the token set ratio so low using fuzzywuzzy?
I am using fuzzywuzzy and rapidfuzz to find names mentioned in comments. I read through the documentation of the "token_set_ratio" function but I still don't understand the following:
# I preprocessed the comments to remove stop words and commonly…

Michael Altorfer
- 21
- 5
0
votes
0 answers
Python: TypeError: can't pickle module objects multiprocessing on Jupyter Notebook
I am sorry that my code might look confusing, but what it does is that it reads in 300,000 items and try to cross-reference them to another file. (It tries to find the best match of the item descriptions from another file).
I know that the library…

Student04
- 55
- 6