C# - Fuzzy compare of two large string arrays

Question

I need to find all strings in B that "partly" exists in A.

B = [ "Hello World!", "Hello Stack Overflow!", "Foo Bar!", "Food is nice...", "Hej" ]
A = [ "World", "Foo" ]
C = B.FuzzyCompare(A) // C = [ "Hello World!", "Foo Bar!", "Food is nice..." ]

I've been looking into using Levenshtein Distance Algorithm for the "fuzzy" part of the problem, as well as LINQ for the iterations. However, A * B usually results in over 1,5 billion comparisons.

How should i go about this? Is there a way to quickly "almost compare" two Lists of strings?

You examples don't seem to lend themselves to Levenshtein Distance. They look more like checking for substrings. — juharr, Jul 12 '16 at 15:00
@juharr yep, this is an ideal problem for a Suffix Trie, per my answer below. — Haney, Jul 12 '16 at 15:13
How many elements are usually in `B` and `A`? And how large are the elements, on average, in `B`, e.g. "Hello World!" == 12? Also, are the elements in `A` just single words? — Jarrod Dixon, Jul 12 '16 at 17:49
@JarrodDixon on average the elements in A and B are between 100 and 200 characters. — Olian04, Jul 13 '16 at 08:58

score 5 · Accepted Answer · answered Jul 12 '16 at 15:01

5

Maybe it's sufficient to simply compare substrings, this would be much more efficient:

var C = B.Where(s1 => A.Any(s2 => s1.IndexOf(s2, StringComparison.OrdinalIgnoreCase) >= 0)).ToList();

answered Jul 12 '16 at 15:01

Tim Schmelter

450,073
74
686
939

Providing he wants to ignore case? – Liam Jul 12 '16 at 15:02
1

@Liam: i think case doesn't matter if he even wants to find similar strings. If it does he can simply change it to a different `StringComparison`. If i would have shown `String.Contains` he would not have known how to compare case insensitively. – Tim Schmelter Jul 12 '16 at 15:04
@Rango can you have a look at similar kind of problem I have but I am actually passing threshold value for similarity https://stackoverflow.com/questions/54511595/optimize-matching-elements-from-two-large-data-sets-using-levenshtein-distance/54516063?noredirect=1#comment95866563_54516063 – Shahid Manzoor Bhat Feb 07 '19 at 08:42
@Rango Can we somehow specify a percentage of similarity in this query? Like only select the words from List B which are 70% or more similar to the word in List A? Thanks – Shahid Manzoor Bhat Feb 11 '19 at 07:07
@Rango Thank you very much, I tried using the Levenshtein distance approach. But the problem I face is of quadratic scaling, This is what I did https://stackoverflow.com/questions/54511595/optimize-matching-elements-from-two-large-data-sets-using-levenshtein-distance Can you suggest how I can optimize this? – Shahid Manzoor Bhat Feb 11 '19 at 08:03

Haney · Answer 2 · 2016-07-12T15:13:58.797

4

This seems like a good use of a Suffix Trie.

A Suffix Trie is a tree with no payload. It indexes all suffixes of a given string or sentence so that they can be searched in O(n) time. So, if your input in A was "hello", it would index "hello", "ello", "llo", "lo", and "o" in a way that would allow any of those substrings to immediately and efficiently be looked up without any additional enumeration of the set of A.

Basically, take all the values in A and process them into a Suffix Trie which is an O(n * m) operation done once where n is the number of elements in A and m is the length of the elements. Then for each element of B check for it in the Suffix Trie which is also an O(n * m) operation where n is the number of elements in B and m is the length of the elements.

edited Jul 12 '16 at 15:13

answered Jul 12 '16 at 15:00

Haney

32,775
8
59
68

I'd love to upvote this as it sounds ace but I just don't understand it... :) – Liam Jul 12 '16 at 15:03
2

@Liam basically the original question specifically asks to avoid the A * B cost of 1.5 billion comparisons, which this answer does, but the others I see here do not. – Haney Jul 12 '16 at 15:17
You might be able to help me out here, I wanted to update my answer in relation to yours but I wans't sure: http://stackoverflow.com/questions/38332900/what-woukd-the-big-o-be-of-a-for-loop-with-an-any-inside-it – Liam Jul 12 '16 at 15:18
@Haney can you help me out here https://codereview.stackexchange.com/questions/212823/the-process-to-calculate-the-levenshtien-distance-of-each-element-of-a-large-dat – Shahid Manzoor Bhat Feb 07 '19 at 09:01

score 3 · Answer 3 · edited May 23 '17 at 12:14

3

I think you may be other thinking it:

List<string> results = new List<string>();
foreach (string test in B)
{
   if (A.Any(a => test.Contains(a))
      results.Add(test);
}

BTW the complexity of this is somewhere in the region of O(n)(best) and O(n*m)(worst) (where n is the numer of results in A and m is the number of results in B)

edited May 23 '17 at 12:14

Community

1
1

answered Jul 12 '16 at 15:00

Liam

27,717
28
128
190

C# - Fuzzy compare of two large string arrays

3 Answers3

Linked