regex that can handle horribly misspelled words

Question

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?

I am dealing with horrible OCR/scanning, and I can stand the false positives.

Is there a way to do this?

Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).

I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.

Can you define more strict rules what do you expect on input and output? That said, if useful method of this nature (regex-based) really existed and improved OCR recognition, I am sure that OCR folks would have jumped on it and used it across all OCR software by now. — mvp, Jun 04 '13 at 03:39
If you already know which characters are going to be in the word processed with OCR/scanning, there is something to do, but I don't think it'd be relevant to use OCR in that case. — Frederik.L, Jun 04 '13 at 03:42
@Frederik.L It seems Soandos will run whole dictionary to 'guess' the word with whatever he can read in each word. — Yogee, Jun 04 '13 at 03:44
@Cyclone Regular expressions are best at handling well defined grammars and not needing to count. This question defies both. — Patashu, Jun 04 '13 at 03:56
Can you elaborate a bit more on your question? Are you only looking to find mispelling in your string, or how many different dictionary words are you looking for to try and match? — Patashu, Jun 04 '13 at 04:11
@Patashu I only want to look for one at a time (i.e. I only want to match a piece of text with one word (in the above example, misspelling). If it matches misspelling, perfect, if not, give up. — soandos, Jun 04 '13 at 04:13
@soandos Perfect, that makes it pretty easy (see my answer). — Patashu, Jun 04 '13 at 04:17
@soandos Maybe the "Regex" could be removed from the question and instead just focus on the main problem (or maybe make it a suggestion instead of a requirement). That might spawn more answers that would solve your issue. — default, Jun 04 '13 at 06:27

score 3 · Answer 1 · answered Jun 04 '13 at 03:55

3

I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().

You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns. For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.

Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.

But this would not look anything like regex - sorry.

answered Jun 04 '13 at 03:55

mvp

111,019
13
122
148

I do not want to use levenshtein distance, as a distance of less than x does not mean that it has all the letters that I care about (the range is going to be large, and the distance is expected to be large (~10 or so)). I cannot create a dictionary of all the possibilities as there are too many (trillions) – soandos Jun 04 '13 at 03:58
Your best bet, then, is to map to the "closest" valid word - how that is precisely defined is up to you, ultimately. Try to inject rules about common OCR messups, like rn == m and others. – Jun 04 '13 at 04:02

score 0 · Answer 2 · answered Jun 04 '13 at 04:06

0

To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.

answered Jun 04 '13 at 04:06

FireCrakcer37

981
10
18

That is not really an acceptable solution. I need something can work at scale, and it seems to be a rather defined problem (even if it is not suitable for regex) – soandos Jun 04 '13 at 04:07

score 0 · Answer 3 · answered Jun 04 '13 at 04:16

This can't be done with a regex, but it can be done with a custom algorithm.

For example, to find words that are like 'misspelling' in your body of text:

1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}

2) Split the body of text into words.

3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.

If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.

Is there a way to preserve the ordering? gnillepsssim is clearly not a match for example — soandos, Jun 04 '13 at 04:18

Frederik.L · Answer 4 · 2013-06-04T05:59:04.547

0

I have no solution for this problem, in fact, here's exactly the opposite.

Correcting OCR errors is not programmaticaly possible for two reasons:

You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.

Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?

Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.

UPDATE:

After digging a little, I found a reference that may be relevant: String similarity measures

edited Jun 04 '13 at 05:59

answered Jun 04 '13 at 04:41

Frederik.L

5,522
2
29
41

I am not trying to correct the OCR. Instead, I am trying to determine if some string is the terminator of a field. I have determined the error rates that I am comfortable with, I just don't have a way to apply that. – soandos Jun 04 '13 at 04:43
@soandos you could use the Levenshtein distance on the field terminator candidates and the actual field terminator word. See which word has the lowest distance _or_ which field is under a certain boundary. – AutomatedChaos Jun 04 '13 at 05:39
@soandos You may take a look at the reference I found. There's some heavy algorithm to implement, but it shows the pros and cons of each strategies. – Frederik.L Jun 04 '13 at 06:02

regex that can handle horribly misspelled words

4 Answers4