0

I am attempting to match words within a string however I do not want to match words that are part of another word... poor explanation, onto the example!

If have the word pen. I want to match that word within a string:

01pennsylvania' should not match as pen is part of the word pennsylvania.

However, pensforsale should match as pen isn't part of another word. I've been looking into NLTK but I can't find what I'm looking for, can anyone point me in the right direction? I know it would be impossible to do this for all word combinations but cutting down the noise marginally would be a great help.

Thanks in advance!

Daniel Pilch
  • 2,097
  • 5
  • 24
  • 30
  • What platform are you running on? – wnnmaw Mar 19 '14 at 16:40
  • You're talking about OS right? linux – Daniel Pilch Mar 19 '14 at 16:41
  • So you need to both parse space-less text into words *and* then figure out which *mean* "pen" as opposed to just containing it? Would "pencil" count? How about if an animal is "penned" in? – jonrsharpe Mar 19 '14 at 16:41
  • 2
    I don't understand why `01pennsylvania` should not match `pen`, but `pensforsale` should... – MattDMo Mar 19 '14 at 16:41
  • You seem to be looking for matching word boundaries. May I suggest you to look into a basic regex tutorial? – devnull Mar 19 '14 at 16:42
  • @DanielPilch your best bet is to find a spell check package and us its word dictionary. Unfortunately, the one I prefer ([pyEnchant](http://pythonhosted.org/pyenchant/)) isn't available for linux – wnnmaw Mar 19 '14 at 16:43
  • You seem to think that Python has a large dictionary containing every word in the English language and the sentences in which it could be used. Sadly, it doesn't. – anon582847382 Mar 19 '14 at 16:43

1 Answers1

1

You might find this How to split text without spaces into list of words? as helpful start; by first trying to split your "pensforsale" into a list of words, you could then check for likely-variants, like plurals, etc.

This is going to be a very slow and error-prone way to go, though.

Community
  • 1
  • 1
Joel Burton
  • 1,466
  • 9
  • 11