0

Possible Duplicate:
Matching incorrectly spelt words with correct ones in python

I have to interpret an incoming SMS that looks something like these:

SHOP NAME : CITY

Annies pet shop new york

Budds Calerfonia

Kelvins Boat Shop San Fransico

Karel Boom West palm beach

I have a list of cities and a list of shop names that I have to compare the sms with, if the shop name is there, great, is the city is there, perfect.

Now the thing is, people will spell these wrong etc. And because there is no separator like a comma, how would i know where the word is, starts and stops ?

I have looked at using the levenshtein function, and that returns the closest match in a list. But what if there is no match? Then I have to tell the user, sory, nothing matches your sms etc etc.

How will you go about doing that? Bare in mind, each sms campaign might have different number of parameters.

Community
  • 1
  • 1
Harry
  • 13,091
  • 29
  • 107
  • 167
  • This looks like [this question you asked six hours ago](http://stackoverflow.com/questions/11563615/matching-incorrectly-spelt-words-with-correct-ones-in-python) – inspectorG4dget Jul 19 '12 at 21:53
  • No its not, that did not answer this question. That was just to tell me how to actually look for words. – Harry Jul 19 '12 at 21:54

3 Answers3

0

If the incoming SMS has \n after each line you could split it on it.

Sumedh Sidhaye
  • 299
  • 1
  • 4
  • 14
0

If there is no match, then you could check the sms manually or automatically send an sms back, that the shop/city isn't recognized. If you recognize one of them, than you can add some rules in order to guess the other parameter. For example if the city is recognized, then see if there is only one shop in that city and add it automatically... I would suggest you to add some kind of separator between the attributes.. For example with comma SHOP, CITY

aphex
  • 3,372
  • 2
  • 28
  • 56
0

1) I think there is no way to fix all errors you need to decide what kind of mistakes you what to fix and what formats could be used for data. Don't keep it too fuzzy. With very fuzzy predictions you may consider junk as something valid and it would be hard to understand decision paths and fix bugs.

2) There are several ways of fuzzy matches. I would suggest you to review next: https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

3) Replace all spaces line breaks and extra chars to single space. It would be easier to tokenize your text.

Community
  • 1
  • 1
varela
  • 1,281
  • 1
  • 10
  • 16