2

I am trying to compare two texts (predefined text and user input text).

Every misspelled char must be rated as a mistake.

Example:

Answer is "Handtuchhalter montieren"


"Handtochhalter montieren" = 1 mistake

"Handtuchhalter Montieren" = 1 mistake

"Handtuchhaltermontieren" = 1 mistake

"Hundtochhalter montieren" = 2 mistakes

"Handtochhalter  montieren" = 2 mistakes (2 spaces = 1 mistake)

"Handtochhalter  muntieren" = 3 mistakes (2 spaces)

"Hundtochhalter   muntieren" = 5 mistakes (3 spaces)

The problem is, that you cannot check every char and you cannot check every word. At least the way I did it you could not.

I am programming in C# but I would be happy about algorithms from other languages too.

Isamu
  • 41
  • 1
  • 9
  • 2
    You can use `string.Compare(orginal_data, input_data, true)` to get number of mistakes or unmatched data, if negative then take absolute of that. – Arindam Nayak Oct 12 '14 at 15:36
  • 2
    You might be interested in http://en.wikipedia.org/wiki/Levenshtein_distance – Mulan Oct 12 '14 at 15:38
  • This is a seriuosly non-trivial task. You may find [this interesting](http://stackoverflow.com/questions/24887238/how-to-compare-two-rich-text-box-contents-and-highlight-the-characters-that-are/24970638#24970638) : It is the 'big solution', using a regular diff function.. – TaW Oct 12 '14 at 15:45
  • definitely levenshtein distance. There's a ton of code snippets out there implementing it – Jonesopolis Oct 12 '14 at 16:28
  • @ArindamNayak Thanks for the answer, but this method gives me some sort of random numbers. Example: Answer is "Internet" and input was "Iternne t" the number it gave me was 1... The right number would be 3 – Isamu Oct 12 '14 at 16:33
  • @ArindamNayak: According to [Microsoft](http://msdn.microsoft.com/en-us/library/zkcaxw5y(v=vs.110).aspx) "The comparison terminates when an inequality is discovered [...]". Therefore this method cannot count the number of differencies. – Olivier Jacot-Descombes Oct 12 '14 at 16:34
  • Yes, i agree, got influenced by the inputs shown in question! – Arindam Nayak Oct 12 '14 at 16:38
  • 1
    @Isamu: I'm getting the number `3` with the recursive example shown in en.wikipedia.org/wiki/Levenshtein_distance with `"Internet"` and `"Iternne t"`. – Olivier Jacot-Descombes Oct 12 '14 at 16:53
  • 1
    The iterative version with two matrix rows yields the same result. The code given there is almost valid c# code. Replace `Minimum(a,b,c)` with `Math.Min(Math.Min(a, b), c)`. – Olivier Jacot-Descombes Oct 12 '14 at 17:02
  • @Isamu: Does a letter case difference count as a mistake? (e.g. Montieren vs montieren - 1 mistake?) – Arca Artem Oct 12 '14 at 21:57
  • @OlivierJacot-Descombes First of all, thanks for your answers. I am going to use this version `"iterative version with two matrix rows"` as a temporary solution. @ArcaArtem Yes, your example has to be counted as a mistake. – Isamu Oct 12 '14 at 23:02
  • @ArcaArtem: It's up to you to define whether comparisions are case sensitive or not. Or maybe you got a specification for this aspect of the task. – Olivier Jacot-Descombes Oct 13 '14 at 11:46
  • @OlivierJacot-Descombes Yes, my question was in this context. i.e. what the expected behaviour was. I may have a different solution involving suffix trees – Arca Artem Oct 13 '14 at 15:31
  • @OlivierJacot-Descombes Is there a way to get the index of the char that forces the mistake (using `interative version with two matrix rows` ) ? – Isamu Oct 13 '14 at 20:40
  • @Isamu: Since there can be several mistakes, you would have to return a list of mistakes and since there are two indexes (in `s` and in `t`) you would have either take one of them or return both. The line `var cost = (s[i] == t[j]) ? 0 : 1;` yields a `cost` of `1` when there is a mistake. This is a good place for taking the index (`i` or `j`). Also you would have to consider the special cases where one of the lengths is `0` (see "degenerate cases"). – Olivier Jacot-Descombes Oct 14 '14 at 16:03
  • @OlivierJacot-Descombes So `i` would be the index of the the mistake in string `s` and `j` would be the index of the mistake in string `t` ? – Isamu Oct 14 '14 at 19:59
  • Its the position of the characters compared. Try to add these postions into a list whenever `cost == 1` and see what happens. I didn't make the experiment. – Olivier Jacot-Descombes Oct 14 '14 at 21:02
  • Looks like this has been asked before: http://stackoverflow.com/questions/2344320/comparing-strings-with-tolerance – Owen Davies Oct 23 '14 at 10:31

0 Answers0