Would elasticsearch or RavenDB be better for fueling a statistics engine/random forest?

Question

I've been looking at the following NoSQL databases for the next phase of my project:

elasticsearch positions itself as primarily serving advanced search scenarios while RavenDB positions itself as a document-oriented-database.

Primarily, the document will be around videos. Each has a natural id. That will be the key of the document.

Around that, I add other content in fields which will not necessarily be scalar or flat, as the information will come from a number of different sources with different structures.

For example, there will be content from the video provider's Atom feeds, blog posts that have the video embedded in it, and other pieces of data from a data warehouse project.

There is no set structure across all of the items (each of them will be very domain-specific, actually), the only thing that will relate them is the natural key of the video mentioned above.

That said, once I have this information in one of the above solutions, I'll want to do a number of things with it:

Cull it to help populate variables in a random forest in order to make classifications about the videos
Provide general search on the videos (general free-text, not based on the results of the random forest) through a web-based front end (ASP.NET MVC if you must know)

There are some requirements:

I will more than likely be in a ASP.NET shared web hosting environment. This means I'll have one machine, and won't have access to set up a service. Something embeddable will be very helpful.
The ASP.NET environment will be hosted in IIS, so the embeddable aspect will have to survive app-domain recycling.
I'll want to create new indexes based on the results of the statistical analysis which I can easily fascet which will help with the search on the site.
Support for autocomplete functionality (I know this isn't an "out-of-the box" request, but being able to get to that point is important).
Rich synonym support (there's a number of them in the type of videos I'm indexing content around)

I'm also open to services, such as Truffler, although I do have concerns about the cost (and in Truffler's case, a little concerned about latency between the data centers, because the requests will come from the web host on the West coast, or from a back-end process on the East coast).

Additionally, I don't feel that one solution needs to fit all the requirements. I'm more than fine with having one serve one purpose and having another serve another purpose. Granted, migrations suck, but migrating between these two document stores is a little easier (and I don't expect them to use the same document structure, necessarily).

Trying to determine if this passes the [good subjective/bad subjective](http://blog.stackoverflow.com/2010/09/good-subjective-bad-subjective/) test, but its making my head hurt. Since you already have an answer, let's just play it safe. — , Jan 31 '12 at 16:03
For other readers, an alternative (and interesting) answer for the same question can be found at: http://dba.stackexchange.com/questions/8101/would-elasticsearch-or-ravendb-be-better-for-fueling-a-statistics-engine-random — Ciprian Teiosanu, Apr 14 '14 at 12:01

Andy · Accepted Answer · 2011-12-09T04:03:52.003

2

I want to preface this by saying I'm more familiar with Elastic Search so I might be bias. I think RavenDB looks cool, and could probably fit some of your needs well.

Here is why I would vote for Elastic Search.

I think your general search, faceting, and synonym support will be easier and more powerful in Elastic Search. Elastic Search leverages so many of the awesome search features from Lucene (i.e. stemming, phonetic, etc.)
Elastic Search has better Real Time Searching capabilities. I couldn't exactly figure out if this a strong need of yours, but hey why not have better real time search. Shay explains this very well at Berlin Buzzwords this year.
With Elastic Search, you can start with your one server, and scale to many very easily. It was built with the cloud in mind from the start.

There is an Elastic Search .Net API. I'd love to hear what you decided, and how it worked out.

edited Dec 09 '11 at 04:03

answered Dec 09 '11 at 03:50

Andy

8,841
8
45
68

I'm aware of the distributed nature of elasticsearch, as well as the other points you mentioned. Definitely a +1 and an accept though because of the ability to boost separate fields; RavenDB doesn't support it unless you break out and use the Lucene query API, which means there is no support for boosting at index-time. Regarding the .NET clients, I've not seen a client that really captures all that elasticsearch offers, so I might use HTTP/JSON for what I need; the operations are part of a batch process, so real-time isn't *that* important but it's definitely a nice-to-have. – casperOne Dec 09 '11 at 04:19
1

Just to highlight a few points on RavenDB: it supports faceting and most of what you'd expect from a FTS engine - out of the box. RavenDB also supports boosting on indexing time - both on document level and field level. RavenDB has a very good scale-out story. Not sure what makes Andy to say ES has better RT search, but I agree ES is easier to scale-out thanks to it's cluster notion and discovery capabilities. For comparison's sake, the strongest point of RavenDB is it's client API. – synhershko Oct 23 '12 at 11:53

Would elasticsearch or RavenDB be better for fueling a statistics engine/random forest?

1 Answers1