-1

I will start by saing that i have NO INFLUENCE on the input and suggestions to correct it cant help me. I am asking how to fix the output.

I have descriptions in German. The problem is that some of them were corrupted in the process. Words that have one of 7 German special letters, can have corrupted chars like: ('%�%')
('%¿%')
('%Ø%')
('%¶%')
('%Â%')
('%Ã%')
('%©%')

The difficulty is also because one letter can be "translated" to one corrupted char or even 3 corrupted chars. So the word "für" can be corrupted to "fÂr" or to "f??r" or to "f�r" and i dont have any specific pattern that i can use in regex.

I need to build some algorithm that:

  1. Finds a corruption in a given description.
  2. Finds the correction for the corrupted word.

What do i have?

  1. The description
  2. A German dictionary with all the words that have special chars.

I want to implement it in PHP\Queries but its not mandatory. Any ideas how to do it?

Alex
  • 169
  • 1
  • 1
  • 9

1 Answers1

1

A general algorithm (you'll have to implement it in your programming language) goes like this:

First, let's write our helper function: 1) Given a word, look for each corrupted char in the word.

2) starting with the first, make a switch between a corrupted char and a special german char.

3) see if there are any words (look in the "dictionary") that start with the sub-string of up to the char you just switched. If none, go back to 2 and make a different switch. If there are, keep switching the next cirrupted chars.

4) when you can't switch any more corrupted chars, check if this is a word. If it is, add it to the set if possible words. Else, go back and make a different switch.

Then, let's go to the main algorithm:

1) search for a corrupted char (one of those you stated), this can be done by simply checking all the chars one by one.

2) When you find a corrupted char - send the entire word the char belongs to to the helper function.

3) choose among the options suggested by the helper function, or just let the helper function choose by itself.

4) make the switch, move to the end if the string.

5) return to 1

Sorry for any typos, hope it helpes!

  • Your algo. would work great if the number of corrupted chars would be the same as the number of the correct chars. How it will work with words like "f?r" and "for?" (the second one is a correct one with a ? in the end of the word). The second problem i fear from - running times. For each corrupted word i need to search in the whole dictionary all the combinations of specials german chars... – Alex May 27 '14 at 12:06
  • For the first problem - no, it does not require the amount of corrupted chars to be equal to the amount of special german chars. As for your example with "for?" - in the helper function I should add an option to not modify the string if the special char is in the end of the string, thus it will check what happens when we replace the corrupted char in a special one, and what if we just erase it, so long as this is the last char in the word. For the second problem, the runtime isn't THAT bad because you just need to look at certain words that have a common starting sub-string. –  May 27 '14 at 12:18
  • It's still a problem, but WAY better than checking the entire dictionary. –  May 27 '14 at 12:19
  • 1
    How your algo. solves the fact that in the original word there were 3 letters and in the corrupted i can have 5-6 letters? – Alex May 27 '14 at 12:49