I'm building an application that returns results based on a movie input from a user. If the user messes up and forgets to space out the title of the movie is there a way I can still take the input and return the correct data? For example "outofsight" will still be interpreted as "out of sight".
5 Answers
There is no regex that can do this in a good and reliable way. You could try a search server like Solr.
Alternatively, you could do auto-complete in the GUI (if you have one) on the input of the user, and this way mitigate some of the common errors users can end up doing.
Example:
- User wants to search for "outofsight"
- Starts typing "out"
- Sees "out of sight" as suggestion
- Selects "out of sight" from suggestions
- ????
- PROFIT!!!

- 5,208
- 2
- 34
- 53
There's no regex that can tell you where the word breaks were supposed to be. For example, if the input is "offlight", is it supposed to return "Off Light" or "Of Flight"?

- 364,293
- 75
- 561
- 662
This is impossible without a dictionary and some kind of fuzzy-search algorithm. For the latter see How can I do fuzzy substring matching in Ruby?.

- 1
- 1

- 29,362
- 15
- 90
- 145
You could take a string and put \s*
in between each character.
So outofsight
would be converted to:
o\s*u\s*t\s*o\s*f\s*s\s*i\s*g\s*h\s*t
... and match out of sight
.

- 21,935
- 6
- 63
- 79
-
Is this practical though? Every movie title would have to become such a regex, and then for every search, every regex would have to be tested! For 1,000 movie titles, that's 1,000 regexes to run! – Andrew Cheong Jun 18 '12 at 21:09
-
@acheong87 This is an answer to `What Ruby Regex code can I use for obtaining “out of sight” from the input “outofsight”?` – iambriansreed Jun 18 '12 at 21:19
-
I don't mean disrespect, but his question details a general purpose, and denying the validity of my question because you've answered a very specific case, simply isn't sound. Yes, your suggestion is very good for a regex that matches "out of sight" regardless the spacing. But his problem is to translate incorrect user data (outofsight) to correct stored data (out of sight), and your suggestion to convert the entire stored end to regexes, IMHO, is impractical. It does inspire an alternate solution though: simply strip all spaces on both ends. – Andrew Cheong Jun 18 '12 at 21:40
-
@acheong87 it'd convert the user input, not the movie title, wouldn't it? – Andrew Grimm Jun 18 '12 at 23:11
You can't do this with regular expressions, unless you want to store one or more patterns to match for each movie record. That would be silly.
A better approach for catching minor misspellings would be to calculate Levenshtein distances between what the user is typing and your movie titles. However, when your list of movies is large, this will become a rather slow operation, so you're better off using a dedicated search engine like Lucene/Solr that excels at this sort of thing.

- 14,721
- 2
- 45
- 49