58

My team is working with a third party CMS that uses Solr as a search index. I've noticed that it seems like the authors are using Solr as a database of sorts in that each document returned contains two fields:

  1. The Solr document ID (basically a classname and database id)
  2. An XML representation of the entire object

So basically it runs a search against Solr, download the XML representation of the object, and then instantiate the object from the XML rather than looking it up in the database using the id.

My gut feeling tells me this is a bad practice. Solr is a search index, not a database... so it makes more sense to me to execute our complex searches against Solr, get the document ids, and then pull the corresponding rows out of the database.

Is the current implementation perfectly sound, or is there data to support the idea that this is ripe for refactoring?

EDIT: When I say "XML representation" - I mean one stored field that contains an XML string of all of the object's properties, not multiple stored fields.

jayunit100
  • 17,388
  • 22
  • 92
  • 167
Michael Moussa
  • 4,207
  • 5
  • 35
  • 53

6 Answers6

78

Yes, you can use SOLR as a database but there are some really serious caveats :

  1. SOLR's most common access pattern, which is over http doesnt respond particularly well to batch querying. Furthermore, SOLR does NOT stream data --- so you can't lazily iterate through millions of records at a time. This means you have to be very thoughtful when you design large scale data access patterns with SOLR.

  2. Although SOLR performance scales horizontally (more machines, more cores, etc..) as well as vertically (more RAM, better machines, etc), its querying capabilities are severely limited compared to those of a mature RDBMS. That said, there are some excellent functions, like the field stats queries, which are quite convenient.

  3. Developers who are used to using relational databases will often run into problems when they use the same DAO design patterns in a SOLR paradigm, because of the way SOLR uses filters in queries. There will be a learning curve for developing the right approach to building an application that uses SOLR for part of its large queries or statefull modifications.

  4. The "enterprisy" tools that allow for advanced session management and statefull entities that many advanced web-frameworks (Ruby, Hibernate, ...) offer will have to be thrown completely out the window.

  5. Relational databases are meant to deal with complex data and relationships - and they are thus accompanied by state of the art metrics and automated analysis tools. In SOLR, I've found myself writing such tools and manually stress-testing alot, which can be a time sink.

  6. Joining : this is the big killer. Relational databases support methods for building and optimizing views and queries that join tuples based on simple predicates. In SOLR, there aren't any robust methods for joining data across indices.

  7. Resiliency : For high availability, SolrCloud uses a distributed file system underneath (i.e. HCFS). This model is quite different then that of a relational database, which usually does resiliency using slaves and masters, or RAID, and so on. So you have to be ready to provide the resiliency infrastructure SOLR requires if you want it to be cloud scalable and resistent.

That said - there are plenty of obvious advantages to SOLR for certain tasks : (see http://wiki.apache.org/solr/WhyUseSolr) -- loose queries are much easier to run and return meaningful results. Indexing is done as a matter of default, so most arbitrary queries run pretty effectively (unlike a RDBMS, where you often have to optimize and de-normalize after the fact).

Conclusion: Even though you CAN use SOLR as an RDBMS, you may find (as I have) that there is ultimately "no free lunch" - and the cost savings of super-cool lucene text-searches and high-performance, in-memory indexing, are often paid for by less flexibility and adoption of new data access workflows.

jayunit100
  • 17,388
  • 22
  • 92
  • 167
  • 4
    Batch querying: simply send many HTTP requests concurrently. Streaming: you can trivially emulate this using pagination. Session management/stateful entities: this is only valid for transactional applications. Stress tests: use SolrMeter, no need to do it 'manually'. Joining: it's like that for most (all?) NoSQL databases. – Mauricio Scheffer Sep 03 '12 at 15:23
  • I disagree with the joining comment: In mongo, for example, joining is easier, because the input can be indexed after the fact. Same for RDBMS. Regarding the pagination to mimic streaming, I think you would need to write some sophisticated code to do this, and its still not clear that it would be consistent from request to request. – jayunit100 Sep 24 '12 at 20:45
  • thanks for replying. I'm not very familiar with MongoDB, but the documentation says "MongoDB does not support joins and so, at times, requires bit of denormalization" ( http://www.mongodb.org/display/DOCS/MongoDB+Data+Modeling+and+Rails ). Writing code to simulate streaming with pagination is trivial, at least in .NET (~15 LoC), though you're right that it assumes that the index doesn't change between requests. – Mauricio Scheffer Oct 19 '12 at 15:33
  • Yes you are right mongo queries don't join collections, but data can be joined in many ways using things like m/r or programmatically ... by adding indexes after the fact. – jayunit100 Oct 20 '12 at 02:13
  • 7
    Update: With SOLR 4 : Some of this is not quite so accurate. Transactions are smatter, scale is easier/cheaper and indexing is distributed. – jayunit100 Jan 30 '14 at 01:32
  • The original answer is over 10 years old, how much has changed with version 9 being the latest version? – ColinM Feb 11 '23 at 01:20
30

It's perfectly reasonable to use Solr as a database, depending on your application. In fact, that's pretty much what guardian.co.uk is doing.

It's definitely not bad practice per se. It's only bad if you use it the wrong way, just like any other tool at any level, even GOTOs.

When you say "An XML representation..." I assume you're talking about having multiple stored Solr fields and retrieving this using Solr's XML format, and not just one big XML-content field (which would be a terrible use of Solr). The fact that Solr uses XML as default response format is largely irrelevant, you can also use a binary protocol, so it's quite comparable to traditional relational databases in that regard.

Ultimately, it's up to your application's needs. Solr is primarily a text search engine, but can also act as a NoSQL database for many applications.

theEpsilon
  • 1,800
  • 17
  • 30
Mauricio Scheffer
  • 98,863
  • 23
  • 192
  • 275
  • We have several indexed fields, but only two are actually stored - the document ID and the document XML. So yes, it is effectively just one huge string of XML text that is used to instantiate the retrieved objects on the application side for all 1,000,000 of our indexed objects. – Michael Moussa Nov 23 '10 at 18:39
  • @Mike: IMO that's misusing Solr. Instead, define the corresponding fields in the Solr schema and index them properly. – Mauricio Scheffer Nov 23 '10 at 18:42
  • I am developing an ecommerce in which i have multiple users and multiple type of products per user. Of course i need solr for searching, but i'm not being capable to decide if i have to store the product in the database related to its user and index it to solr, or just store it in solr. I don't like the idea of having the same info stored twice, but it feels more consistent to have it in the database. What would you recommend? – Gonzalo May 04 '18 at 01:41
  • @Gonzalo for a general e-commerce app I'd recommend using a RDBMS as your source of truth. You will need that data normalised to do various queries/reports etc. – Mauricio Scheffer May 08 '18 at 16:07
  • Ok, so the better approach is to have the info stored twice and invest on more hardware – Gonzalo May 08 '18 at 16:10
2

This was probably done for performance reasons, if it doesn't cause any problems I would leave it alone. There is a big grey area of what should be in a traditional database vs a solr index. Ive seem people do similar things to this (usually key value pairs or json instead of xml) for UI presentation and only get the real object from the database if needed for updates/deletes. But all reads just go to Solr.

Joelio
  • 4,621
  • 6
  • 44
  • 80
  • The problem is performance... we have a 10GB core only about 1,000,000 records. Searches are taking between 500ms and 2000ms (which happen often). I thinking it would be faster to search against a smaller core and pull the rows from the db (10-50ms tops). – Michael Moussa Nov 23 '10 at 17:07
  • 2
    @Mike: your index is too large, I'd look into sharding it: http://wiki.apache.org/solr/DistributedSearch – Mauricio Scheffer Nov 23 '10 at 18:22
2

I've seen similar things done because it allows for very fast lookup. We're moving data out of our Lucene indexes into a fast key-value store to follow DRY principles and also decrease the size of the index. There's not a hard-and-fast rule for this sort of thing.

Kent Murra
  • 234
  • 1
  • 2
1

I had similar idea, in my case to store some simple json data in Solr, using Solr as a database. However, a BIG caveat that changed my mind was the Solr upgrade process.

Please see https://issues.apache.org/jira/browse/LUCENE-9127.

Apparently, there has been in the past (pre v6) the recommendation to re-index documents after major version upgrades (not just use IndexUpdater) although you did not have to do this to maintain functionality (I cannot vouch for this myself, this is from what I have read). Now, after you have upgraded 2 major versions but did not re-index (actually, fully delete docs then the index files themselves) after the first major version upgrade, your core is now not recognized.

Specifically in my case, I started with Solr v6. After upgrade to v7, I ran IndexUpdater so index is now at v7. After upgrade to v8, the core would not load. I had no idea why - my index was at v7, so that satisfies the version-minus-1 compatibility statement from Solr, right? Well, no - wrong.

I did an experiment. I started fresh from v6.6, created a core and added some documents. Upgraded to v7.7.3 and ran IndexUpdater, so index for that core is now at v7.7.3. Upgraded to v8.6.0, after which the core would not load. Then I repeated the same steps, except after running IndexUpdater I also re-indexed the documents. Same problem. Then I again repeated everything, except I did not just re-index, I deleted the docs from the index and deleted the index files and then re-indexed. Now, when I arrived in v8.6.0, my core was there and everything OK.

So, the takeaway for the OP or anyone else contemplating this idea (using Solr as db) is that you must EXPECT and PLAN to re-index your documents/data from time to time, meaning you must store them somewhere else anyway (a previous poster alluded to this idea), which sort of defeats the concept of a database. Unless of course your Solr core/index will be short-lived (not last more than one major version Solr upgrade), you never intend to upgrade Solr more than 1 version, or the Solr devs change this upgrade limitation. So, as an index for data stored elsewhere (and readily available for re-indexing when necessary), Solr is excellent. As a database for the data itself, it strongly "depends".

ksl
  • 11
  • 1
0

Adding to @Jayunit100 response, using solar as a database, you get availability and partition tolerance at the cost of some consistency. There is going to be a configurable lag between what you write and when you can read it back.