Best way to store user-submitted item names (and their synonyms)

Question

Consider an e-commerce application with multiple stores. Each store owner can edit the item catalog of his store.

My current database schema is as follows:

item_names: id | name | description | picture | common(BOOL)
items: id | item_name_id | picture | price | description | picture
item_synonyms: id | item_name_id | name | error(BOOL)

Notes: error indicates a wrong spelling (eg. "Ericson"). description and picture of the item_names table are "globals" that can optionally be overridden by "local" description and picture fields of the items table (in case the store owner wants to supply a different picture for an item). common helps separate unique item names ("Jimmy Joe's Cheese Pizza" from "Cheese Pizza")

I think the bright side of this schema is:

Optimized searching & Handling Synonyms: I can query the item_names & item_synonyms tables using name LIKE %QUERY% and obtain the list of item_name_ids that need to be joined with the items table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")

Autocompletion: Again, a simple query to the item_names table. I can avoid the usage of DISTINCT and it minimizes number of variations ("Sony Ericsson Xperia™ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")

The down side would be:

Overhead: When inserting an item, I query item_names to see if this name already exists. If not, I create a new entry. When deleting an item, I count the number of entries with the same name. If this is the only item with that name, I delete the entry from the item_names table (just to keep things clean; accounts for possible erroneous submissions). And updating is the combination of both.

Weird Item Names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CDs + Magic Hat". There's something off about having so much overhead to accommodate cases like this. This would perhaps be the prime reason I'm tempted to go for a schema like this:

items: id | name | picture | price | description | picture

(... with item_names and item_synonyms as utility tables that I could query)

Is there a better schema you would suggested?
Should item names be normalized for autocomplete? Is this probably what Facebook does for "School", "City" entries?
Is the first schema or the second better/optimal for search?

Thanks in advance!

References: (1) Is normalizing a person's name going too far?, (2) Avoiding DISTINCT

EDIT: In the event of 2 items being entered with similar names, an Admin who sees this simply clicks "Make Synonym" which will convert one of the names into the synonym of the other. I don't require a way to automatically detect if an entered name is the synonym of the other. I'm hoping the autocomplete will take care of 95% of such cases. As the table set increases in size, the need to "Make Synonym" will decrease. Hope that clears the confusion.

UPDATE: To those who would like to know what I went ahead with... I've gone with the second schema but removed the item_names and item_synonyms tables in hopes that Solr will provide me with the ability to perform all the remaining tasks I need:

items: id | name | picture | price | description | picture

Thanks everyone for the help!

Started a bounty. Hoping for more answers from all ye DB Gurus out there. — RabidFire, Jan 06 '11 at 13:02
I think the problem is that we're unclear of your REQUIREMENTS. I'm going to suggest what I think is happening. You're equivalent to Amazon. More than one Seller could offer {Nike Air Jordon Red/White 10.5US}. But they can all call them by different names so you have a normalization problem. These aren't SKU'ed items that do have a universal PK. So you're trying to derive that two things are really the same thing by a comparison of characters in the name? And you think this is an issue of the right schema? I don't get it. — Stephanie Page, Jan 06 '11 at 18:47
My requirements would be "Optimized searching", "Handling Synonyms" and "Autocomplete". A user tries to enter an item from a Text Field. Autocomplete tries to prevent too many variations of the same item name. Yes, it is a design issue. I'm looking for a better perspective on choosing the second schema over the first. — RabidFire, Jan 07 '11 at 01:36

score 2 · Accepted Answer · answered Jan 12 '11 at 22:04

The requirements you state in your comment ("Optimized searching", "Handling Synonyms" and "Autocomplete") are not things that are generally associated with an RDBMS. It sounds like what you're trying to solve is a searching problem, not a data storage and normalization problem. You might want to start looking at some search architectures like Solr

Excerpted from the solr feature list:

Faceted Searching based on unique field values, explicit queries, or date ranges

Spelling suggestions for user queries

More Like This suggestions for given document

Auto-suggest functionality

Performance Optimizations

Nice! I took a look at Solr and its features. It seems extremely powerful (especially its Text Analysis) and describes exactly what I'm looking for. Thanks. Bounty awarded. — RabidFire, Jan 13 '11 at 03:21

score 1 · Answer 2 · answered Jan 06 '11 at 19:11

1

If there were more attributes exposed for mapping, I would suggest using a fast search index system. No need to set aliases up as the records are added, the attributes simply get indexed and each search issued returns matches with a relevance score. Take the top X% as valid matches and display those.

Creating and storing aliases seems like a brute-force, labor intensive approach that probably won't be able to adjust to the needs of your users.

answered Jan 06 '11 at 19:11

ScottCher

14,651
6
26
25

I'm assuming you're asking me to remove storing synonyms (aliases)? How would I return search results for "yogurt", "yoghurt", or "yogourt"? http://en.wikipedia.org/wiki/Yoghurt I'm assuming that it will be labor intensive in the beginning. But as the number of items grows, people will mostly be adding previously existing items thanks to Autocomplete. I think the Facebook autocomplete for College Name is a nice example of this. – RabidFire Jan 07 '11 at 01:56
There are indexing systems out there that use fuzzy logic to find matches. Sounds like or similar to, kinds of searching, for example. Not much of an answer, I agree, since it doesn't offer a specific technology - I was just hoping to steer you in a different direction and give you more options. – ScottCher Jan 10 '11 at 21:09
Thanks for the help. Upvoted as it got me thinking about the second schema a little more. I think I'll leave all the heavy lifting to Solr though (based off the answer by another poster). – RabidFire Jan 13 '11 at 03:23

score 0 · Answer 3 · answered Jan 04 '11 at 06:48

0

Just an idea.

One thing that comes to my mind is sorting the characters in the name and synonym throwing away all white space. This is similar to the solution of finding all anagrams for a word. The end result is ability to quickly find similar entries. As you pointed out, all synonyms should converge into one single term, or name. The search is performed against synonyms using again sorted input string.

answered Jan 04 '11 at 06:48

Schultz9999

8,717
8
48
87

That's a nice way of storing anagrams, where words are *synonymous* to each other if the sorted characters with white space removed are the same. But I don't think I want to return "tops" when a user searches for "pots". :) – RabidFire Jan 04 '11 at 07:49

Best way to store user-submitted item names (and their synonyms)

3 Answers3