5

I need an algorithm that can recognize words (dictionary based) in a sequence of characters that has no white spaces.

lets say for example, the sequence is:
spaceless
it should recognize space and less.

and there might be situations where more words can be recognized. its hard to give such an example but I'll give it a try:

example: spaceslight
recognized words: space and slight (1)
recognized words: spaces and light (2)

so the algorithm should be able to find those kind of variations too.

Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
Antagonist
  • 612
  • 4
  • 17
  • You're question is similar to this one :http://stackoverflow.com/questions/6766346/algorithm-to-parse-string-with-dictionary/6767145#6767145 I think – Ricky Bobby Aug 09 '11 at 09:34
  • have't tried anything yet , but the first thing that comes to my mind is to fetch all the words from the dictionary and try to match them against the string sequence . but it looks like it will be terribly slow. – Antagonist Aug 09 '11 at 09:36
  • tnx Ricky , that should help , – Antagonist Aug 09 '11 at 09:39
  • 1
    [This](http://thenoisychannel.com/2011/08/08/retiring-a-great-interview-problem/) might help, just saw it yesterday. – rafalio Aug 09 '11 at 09:39
  • @Mitch Wheat: what's wrong with asking for an algorithm? Why he should try 'something' if he doesn't know at all where and how to start? By my opinion these kind of questions are much better than please-debug-my-code questions. – Tomas Aug 09 '11 at 10:30
  • 1
    @Tomas Telensky: I was under the impression that the reason homework was given out was to get the student to attempt it. – Mitch Wheat Aug 09 '11 at 11:36
  • @Mitch Wheat: 1) the homework tag was added by me, this is how I understand the typical "homework" question 2) see my point here: http://meta.stackexchange.com/questions/101473/whats-wrong-with-asking-for-an-algorithm-much-better-than-please-debug-my-code – Tomas Aug 09 '11 at 11:45
  • @Tomas - I don't think it's at all certain that this is homework. I can easily see this question popping up as a business problem, and finding a fast algorithm for this is going to be vital. I honestly don't see a problem with this question. – Justin Morgan - On strike Aug 09 '11 at 13:59
  • @Justin Morgan: 1) I also don't see any problem with this question!!! I don't understard at all why was it closed, see my Q on meta: http://meta.stackexchange.com/questions/101473/whats-wrong-with-asking-for-an-algorithm-much-better-than-please-debug-my-code 2) Maybe I misunderstood the homework tag - I understood it as it is homework for us! :-) – Tomas Aug 09 '11 at 14:16
  • It should have been closed as a duplicate, not as "not a real question". – Fantius Aug 09 '11 at 19:53
  • @Tomas - [tag:homework] means the asker is requesting help on his school homework. This lets other people know that they should guide the student to solve the problem himself, rather than simply handing him the answer. Questions should only be tagged as homework when that's known to be the case, such as when the asker has said so, or when it's blindingly obvious, as in [this question](http://stackoverflow.com/q/1694971/399649). – Justin Morgan - On strike Aug 11 '11 at 05:28
  • @Justin Morgan: thanks, this is much better description - I've put it to tag wiki, if you don't mind. – Tomas Aug 11 '11 at 10:42

3 Answers3

1

If you need multiple queries on the same string a suffix trie is a good solution. This will store the string very efficiently and allows lookup of queries in O(n) where n is the length of the query (note that you cannot do better unless you have more knowledge of the queries).

If a suffix trie still is using up too much space, you can use a DAWG, but this is much more complicated to build.

LiKao
  • 10,408
  • 6
  • 53
  • 91
1

You can also try the Knuth-Morris-Pratt algorithm. It searches for strings in text... If I remember it correctly it has a linear complexity. Here have a look:

http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

PS: You might need to tweak it a little bit to your needs...

Fred
  • 16,367
  • 6
  • 50
  • 65
1

You might want to look at the Rabin-Karp algorithm, it allows a single pass through the text file to search for all the n letter words in the dictionary for some value of n. Standard Rabin-Karp will find overlaps: spaceslight -> spaces, a, ace, aces, slight, light, i. You would need to modify it if you didn't want overlapping words.

rossum
  • 15,344
  • 1
  • 24
  • 38