What is the best data mining method for vehicle search?

Question

I'm trying to build a search engine that goes through online vehicle classifieds such as Oodle, eBay motors, and craigslist. I also have a large database of standard vehicle names and specifications about them. What I would like to do is for each record that I find through the classified site, be able to determine exactly what vehicle model, style it is (from my database). For example, a standard name for a ford truck in my db is: 2003 Ford F150.

However on classified sites, people might refer to is as: "2003 Ford F 150" or "2003 Ford f-150" or "03 Ford truck 150". Is there an effective data mining/text classification algorithm to be able to normalize these texts to the standard name above?

score 1 · Accepted Answer · answered Apr 23 '09 at 17:52

You could use the Levenshtein distance to match the found string against your database records.

Another (probably better) idea is to tokenize the strings and use a term vector model for the vehicle names. This way you can use cosine similarity to find relevant matches.

score 0 · Answer 2 · answered Dec 05 '13 at 16:28

If you're gonna develop a whole search engine intended to scale in both, usage and size, you will need something robust to support your queries.

If you're gonna used edit distance, Bed-trees provide a good alternative for your index structure. Another good approach, depending on the size of your dataset, is to use a Levenshtein automata. Levenshtein automatas are also great at providing auto-complete functionalities, which you may need since you're developing a search engine.

Another approach to edit distance is to use n-grams combined with Jaccard index. For this approach you can use Minhash + LSH. Also, you can use Jaccard as a distance metric (1 - Jaccard index) which respects the triangle inequality, thus, can be used in a metric tree such as a VP-tree.

One of these approaches will certainly help you.

What is the best data mining method for vehicle search?

2 Answers2