0

I'm trying to build a search engine that goes through online vehicle classifieds such as Oodle, eBay motors, and craigslist. I also have a large database of standard vehicle names and specifications about them. What I would like to do is for each record that I find through the classified site, be able to determine exactly what vehicle model, style it is (from my database). For example, a standard name for a ford truck in my db is: 2003 Ford F150.

However on classified sites, people might refer to is as: "2003 Ford F 150" or "2003 Ford f-150" or "03 Ford truck 150". Is there an effective data mining/text classification algorithm to be able to normalize these texts to the standard name above?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
wsb3383
  • 3,841
  • 12
  • 44
  • 59

2 Answers2

1

You could use the Levenshtein distance to match the found string against your database records.

Another (probably better) idea is to tokenize the strings and use a term vector model for the vehicle names. This way you can use cosine similarity to find relevant matches.

Pankrat
  • 5,206
  • 4
  • 31
  • 37
0

If you're gonna develop a whole search engine intended to scale in both, usage and size, you will need something robust to support your queries.

If you're gonna used edit distance, Bed-trees provide a good alternative for your index structure. Another good approach, depending on the size of your dataset, is to use a Levenshtein automata. Levenshtein automatas are also great at providing auto-complete functionalities, which you may need since you're developing a search engine.

Another approach to edit distance is to use n-grams combined with Jaccard index. For this approach you can use Minhash + LSH. Also, you can use Jaccard as a distance metric (1 - Jaccard index) which respects the triangle inequality, thus, can be used in a metric tree such as a VP-tree.

One of these approaches will certainly help you.

Felipe Martins Melo
  • 1,323
  • 11
  • 15