How to String.Contains() the Fuzzy way in C#?

Question

I have a list of persons that I want to search for while filtering. Each time the user enters a search string, the filtering is applied.

There are two challenges to consider:

The user may enter part of names
The user may mistyping

The first one is simply resolved by searching for substrings e.g. String.Contains(). The second one could be resolved by using a Fuzzy Implementation (e.g. https://fuzzystring.codeplex.com)

But I don't know how to master both challenges simultaneously.

For example: I want to find the person "Dr. Martin Fowler" when entering one of:

"Martin"
"Fawler"
"Marten Fauler"

I guess I need to write a "FuzzyContains()" logic, that handle my needs and also has an acceptable performance. Any advices how to start?

How about writing your own comparefunction that matches up chars in each position in the input strings and give you a % match for each name in your list? Then you can display a ordered list based on closest match — Logard, Sep 04 '14 at 07:44
Go through all the strings one by one, apply contains first, if match add it to ListA, then apply FuzzyString, if match add it to ListB. Then all the match string list is ListA.Union(ListB). — Haojie, Sep 04 '14 at 08:24

score 6 · Accepted Answer · edited Jun 10 '20 at 17:51

6

I modified Oliver answer who suggested the Levenshtein Distance algorithms, that is not the best choice here, since the calculated distance is to big when only parts of the names were entered. So, I ended up using the Longest Common Subsequence algorithm implemented by the awesome FuzzyString Lib.

const int TOLERANCE = 1;
string userInput = textInput.Text.ToLower();
var matchingPeople = people.Where(p =>
{
     //Check Contains
     bool contains = p.Name.ToLower().Contains(userInput);
     if(contains) return true;

     //Check LongestCommonSubsequence
     bool subsequenceTolerated = p.Name.LongestCommonSubsequence(userInput).Length >= userInput.Length - TOLERANCE;

     return subsequenceTolerated;
}).ToList();

edited Jun 10 '20 at 17:51

Shashank Shekhar

3,958
2
40
52

answered Sep 05 '14 at 11:11

mamuesstack

1,111
2
16
34

1

Do not convert a string to lower just to check if it is contained. That is really bad for performance because you convert the whole string instead of just comparing it. Because Contains doesn't have an overload with StringComparision, use IndexOf. – Snicker Aug 16 '17 at 14:07
Contains has a StringComparison overload in dotnet core. – user888734 Oct 22 '19 at 08:41

score 4 · Answer 2 · edited Aug 25 '22 at 11:16

4

Seems to be a job for the Levenshtein distance algorithm (one of the dozens C# implementations).

You give this algorithm two strings (the one the user entered and one out of your list). Then it calculates how much characters must be replaced, added or removed to come from the first string to the second one. Then you can take all elements from your list where the distance is smaller or equal three (for example) to find simple typos.

If you have this method you could maybe use it that way:

var userInput = textInput.Text.ToLower();
var matchingEmployees = EmployeeList.Where(x =>
    x.Name.ToLower().Contains(userInput) ||
    LevenshteinDistance.Compute(x.Name.ToLower(), userInput) <= 3).ToList();

edited Aug 25 '22 at 11:16

Electric Coffee

11,733
9
70
131

answered Sep 04 '14 at 08:17

Oliver

43,366
8
94
151

Thanks for this idea. The Problem is when searching for "Dr. Martin Fowler" by typing "Marten" it's not containing and the Levenshtein Distance is 16! – mamuesstack Sep 05 '14 at 05:52
2

You could split each name at a whitespace (e.g. `x.Name.ToLower().Split(' ')`) and test all elements with the levenshtein. But i think if you have later a real (and probably big) list you'll get a lot of *valid* entries which absolutely don't match. – Oliver Sep 05 '14 at 10:01
To get this really to work, you could implement some kind of learning algorithm, by saving the keywords the user tried and what he finally picked and save this stuff somewhere. Then you could test the user input not only against the real list, but also against this mapping list. But i think it is quite hard to do it really right if these already mentioned mechanisms here doesn't match your needs. – Oliver Sep 05 '14 at 10:04
This solution works, but it could be done more efficiently using [a modified version](https://stackoverflow.com/a/37804497/975097) of the Levenshtein distance algorithm. – Anderson Green Apr 24 '22 at 00:38

Gary Walker · Answer 3 · 2014-09-04T08:04:53.680

I've done this myself before and started with the some of the methods listed on wikipedia approximate string matching. When I got done, I tuned my algorithm in ways that were not as general purpose, but gave me better matches in my domain.

If your whole dictionary is in memory and not too large, you can simply apply you matching algorithm against every member in the dictionary. If your dictionary is large, this will likely overuse your resources and you will need a better algorithm. You might want to consider using full text search feature of your database too.

In my case, I iterated though each string in my dictionary comparing "matching runs", i.e., 2 points for having a 2 character match, 3 for a 3 character match up to an 8 character match. I ran though all possible of the pairs, triples, etc. -- scoring each dictionary entry and selecting the highest scoring match. Tolerates typos, word order, etc. but computationally expensive -- my dictionary was a most a few thousand phrases so this worked very well for me. This is a modified version of Dice's coefficient.

score 0 · Answer 4 · answered Sep 04 '14 at 07:47

0

Did you try brute force? The simplest way would be to match the search string with substrings of target strings starting from the beginning and then take closest match from all of the matches.

But this might not be acceptable from performance view.

answered Sep 04 '14 at 07:47

Euphoric

12,645
1
30
44

score 0 · Answer 5 · answered Sep 04 '14 at 07:51

0

Maybe you could use this soundex implementation: CodeProject What soundex does is comparing two strings and calculating the "pronunciation-similarity" in percentage. A time ago i build a search with the help of this function(PHP hasit built-in)

answered Sep 04 '14 at 07:51

N1C0

121
4

Thanks for sharing. The soundex implementation only assist the user when mistyping. It doesn't work when entering only a part of the name. – mamuesstack Sep 04 '14 at 07:59
There i would use `string.contains()` – N1C0 Sep 04 '14 at 08:05

score 0 · Answer 6 · edited Nov 27 '14 at 21:26

I had a project for school some time ago, where we had a textbox in which students could search for every employee, student that has something to do with the school. We were talking about a couple of hundred people. Simple Linq query that we used was blazingly fast on a Core i3 processor. The query was called every time a user typed something in the textbox. In a TextChanged event we called a query that looked like this:

var resultData = EmployeeList.Where(x=>x.Name.ToLower().Contains(textInput.Text.ToLower())).ToList();

Of course, this logic applies only if you have "Dr. Martin Fowler" in one property or a member.

Yes, that works when entering a part of the name to find. But it doesn't work when mistyping — mamuesstack, Sep 04 '14 at 08:03

score 0 · Answer 7 · answered Sep 04 '14 at 09:32

In other languages like python we have cool stuffs for text processing including distance computations. There are some algorithms such as Levenshtein that computes the fuzzy distance between two strings. I saw some implementations in C# (in here) and also another module was difflib which is available in here. the outputs of these algorithms is a number. the closer to 0 the better.

How to String.Contains() the Fuzzy way in C#?

7 Answers7

Linked

Related