2

I am working on a web app where I collect data about mobile phones from several websites. The problem is that websites use slightly different naming of mobile phones. For eg the websites use the following variation in names for these two mobiles:

HTC One X+ (Black); HTC One X+ Black; HTC One X Plus; HTC One X Plus, black

Samsung Galaxy S3 (Pebble Blue, with 16GB); Samsung Galaxy S III (Blue); Samsung Galaxy S3 I9300 16GB Pebble Blue; Samsung I9300 Galaxy S III (16 GB); Samsung Galaxy S3 (I9300), pebble blue

Since I read this data off these websites using a crawler I need my program to resolve all these different strings into same product.

Any ideas? If it matters, I am using python.

shreyj
  • 1,759
  • 3
  • 22
  • 31
  • A dictionary? They keys could be the various different strings but they can all reference the same product. Perhaps you could use a defaultdict so that any unrecognised strings were still stored. – aychedee Mar 03 '13 at 18:35

2 Answers2

3

You could use different approaches for this (and for most efficiency mix them):

  1. You could ignore everythin that is in parenthesis.
  2. Define words you automatically drop like "black", "blue" or "white".
  3. Compare the names via their Levenshtein distance and use this distance for clustering.
  4. Surface similarity (thanks to mbatchkarov)
Hyperboreus
  • 31,997
  • 9
  • 47
  • 87
1

I'm sure the difflib module will help you a lot

eyquem
  • 26,771
  • 7
  • 38
  • 46