1

What would be considered best practice when you need additional data about facet results.

ie. i need a friendlyname / image / meta keywords / description / and more.. for product categories. (when faceting on categories)

  • include it in the document? (can lead to looots of duplication)
  • introduce category as a new index in solr (or fake by doctype=category field in solr)
  • use a rdbms to lookup additional data using a SELECT WHERE IN (..category facet result ids..)

Thanks,

Remco

Remco Ros
  • 1,467
  • 15
  • 31
  • Please quantify "lots of information". How many entities are you dealing with and how complex is your data model? – Jesvin Jose Jan 25 '12 at 17:16
  • It's not very complex. It's just more data then I want to index in solr. ie: I have a product catalog indexed (with a multi valued category_id field. But a category is first class entity in the system), So I need category name/url/image/meta data, etc. too – Remco Ros Jan 26 '12 at 13:09

4 Answers4

2
  • use fast NoSQL db that fits your data

BTW Lucene, which is Solr's underlying layer, is in fact also NoSQL-type storage facility.

If I were you, I'd use MongoDB. That's the first db that came to mind, since you need binary data and they practically invented BSON, which is now widespread mean of transferring binary data in a JSON-like fashion.

If your data structure is more graph-shaped (like social network) check out Neo4j, which has blindingly fast graph traversal algorithms.

Marko Bonaci
  • 5,622
  • 2
  • 34
  • 55
1

A relational DB can reliably enforce the "category is first class entity" thing. You would need referential integrity: a product may not belong to a category that doesnt exist. A deleted category must not have it's child categories lying around. A normalized RDB can enforce referential integrity through schema. A NoSQL DB must work with client-side code (you must write) to enforce referential integrity.


Lets see how "product's category must exist" and "subcategories' parents must exist" are done:

RDB: The table that assigns categories to products (an m:n relation) must be keyed up to the product and category by an ON DELETE CASCADE. If a category is deleted, a product simply cannot have such a category. A category that links up to another category as a child: the relavent field has an ON DELETE CASCADE. This means that if a parent is deleted, it's children cannot exist. This entire method is declarative ("it is declared thus"), all complexities exist in the data, we dont need no stinking code to do it for us. You can model a DB as naturally as you understand their real world implications.

Document store-type NoSQL: You need to write code to do everything. A "category is deleted" is an use case, and you need to find products that have that category, and update each one. You have to write code for each use case. Same goes for managing subcategories. The data model may be incredibly stupid, but their real-world implications must be modeled in the code. And its tougher to reason in code and control flow rather than in data structures.

Do you really have performance needs that require NoSQL databases?

So use RDBMSs to manage your data. Then use Direct Import handler or client-side code to insert/update denormalized entities for searching. If most requests to your site can be expressed in Solr queries, great!


As for expressing hierarchial faceting in Solr, see ' Ways to do hierarchial faceting in Solr? '.

Community
  • 1
  • 1
Jesvin Jose
  • 22,498
  • 32
  • 109
  • 202
0

I would think about 2 alternatives:

1.) strong the informations for every document without indexing it (to keep the index small as possible). The point is, that i would not store the image insight Lucene/Solr - only an file pointer.

2.) store the additional data on an rdbms or nosql (linke mongoDB) to lookup, as you wrote.

My favorite is the 2nd. one, because an database is the traditional and most optimized way to storing data. But finally it depends on your system, because you should keep in mind, that you need time for connecting an database, searching through the data and sending the additional information back to the application. So it could be faster to store everything on lucene.

Probably an small performance test would be useful.

The Bndr
  • 13,204
  • 16
  • 68
  • 107
  • On the Category page I am currently doing this: Query current category and it's childs using SQL. Query solr product index, facet by category_id. intersect both using unique id to construct a viewmodel, containing Categories and counts from the facet. – Remco Ros Jan 26 '12 at 13:11
  • Another option would be to store category entities in the index. and issue 2 solr queries: - one to get category/subcateries out of the category index. - one to get category facets out of the product index. - intersect on unique id. – Remco Ros Jan 26 '12 at 13:14
  • A problem with both (intersecting) is that you cannot calculate the paging anymore. ie. for product listing, I query solr for products. then intersect with a Database to look if there is available stock. How would you handle paging in this case (cause I don't want to fetch too much from the server in one go) – Remco Ros Jan 26 '12 at 13:17
  • I went for the solr as a pure search/facet index and sql for data store approach. – Remco Ros Jan 28 '12 at 21:38
  • @Remco Ros >How would you handle paging in this case< i would handle pagination on solr using offset. If solr returns the PK for your database, you don't have to fetch to much. But, using the "fetch" feature from your database would be difficult at this configuration. – The Bndr Jan 31 '12 at 08:46
  • I first query solr, with paging. It return's PK's for the entities I need to fetch from the sql db. I query the sql db using a WHERE IN (concat all PK's) – Remco Ros Feb 01 '12 at 15:26
0

maybe I am wrong, but if you are on Solr trunk you could benefit from Solr join suport, this would allow you to index several entities with relations among them while enforcing conditions on both.

Persimmonium
  • 15,593
  • 11
  • 47
  • 78