0

I am trying to match an input (coming from a third-party software) like this:

PIPPO CANASTA PER FT 501 del 1/11/2016

against a list of people that can be modelized as an array of strings (coming from another software)

[
  ...
  "CANASTA              PIPPO"
  ...
]

How can I accomplish this using C# (.NET)?

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Genesio
  • 171
  • 2
  • 13
  • This doesn't seem to be a fuzzy search. If you'd accept "Cannasta" it would be. You can search for "Levenshtein distance" if you're interested in that. – Carra Dec 20 '16 at 10:18
  • [.NET library for text algorithms?](http://stackoverflow.com/questions/4508307/net-library-for-text-algorithms) – Y.B. Dec 20 '16 at 10:26
  • Stefan Szakal's Blog [Text similarity algorithms via C#](http://www.stefanszakal.co.uk/text-similarity-algorithms-via-c/) looks interesting too. – Y.B. Dec 20 '16 at 10:37

1 Answers1

1

You can split each string into an array of words and search the list for the most number of matching elements:

string[] arrayToSearch = new string[] {
    "OTHER STUFF",
    "CANASTA              PIPPO",
    "MORE STUFF"
};

string stringToFind = "PIPPO CANASTA PER  FT 501 del 1/11/2016";

string[] wordsToFind = stringToFind.Split(default(Char[]), StringSplitOptions.RemoveEmptyEntries);

string bestMatch = arrayToSearch.OrderByDescending(
    s => s.Split(default(Char[]), StringSplitOptions.RemoveEmptyEntries)
          .Intersect(wordsToFind, StringComparer.OrdinalIgnoreCase)
          .Count()
).FirstOrDefault();

Console.WriteLine("Best match: " + bestMatch);
Console.ReadKey();            
Y.B.
  • 3,526
  • 14
  • 24
  • The suggestion is nice and simple, I must admit. I am going to add some kind of logic (such as words 'weight') because I have some weird cases when prepositions and/or abbreviations can return misleading results – Genesio Dec 20 '16 at 10:06
  • One simple way to improve the above search results would be through taking into account matching words length, but depending on the size of the list being searched and how often it updates you might as well consider some ready-made solutions. – Y.B. Dec 20 '16 at 10:13
  • thank you for adding the snippet. That's more less what I ended up with. – Genesio Dec 20 '16 at 10:13
  • I could evaluate some libraries but I have no idea from where to start. googling didn't lead me to desired results – Genesio Dec 20 '16 at 10:14
  • I use SQL Server [Full-text Index](https://msdn.microsoft.com/en-us/library/ms187317.aspx) in my work, and it yields pretty good results, I have also seen some pure .net implementations of the above with few layers of cleaning and filtering that works rather well too. – Y.B. Dec 20 '16 at 10:21