0

Let's say I've got a collection of 10 million documents that look something like this:

{
"_id": "33393y33y63i6y3i63y63636",
"Name": "Document23",
"CreatedAt": "5/23/2006",
"Tags": ["website", "shopping", "trust"],
"Keywords": ["hair accessories", "fashion", "hair gel"],
"ContactVia": ["email", "twitter", "phone"],
"Body": "Our website is dedicated to making hair products that are..."}

I would like to be able to query the database for an arbitrary number of, including 0 of, any of the 3 attributes of Tags, Keywords, and ContactVia. I need to be able to select via ANDS (this document includes BOTH attributes of X and Y) or ORs (this document includes attributes of X OR Y).

Example queries:

  • Give me the first 10 documents that have the tags website and shopping, with the keywords matching "hair accessories or fashion" and with a contact_via including "email".

  • Give me the second 20 documents that have the tags "website" or "trust", matching the keywords "hair gel" or "hair accessories".

  • Give me the 50 documents that have the tag "website".

I also need to order these by either other fields in the documents (score-type) or created or updated dates. So there are basically four "ranges" that are queried regularly.

I started out SQL-based. Then, I moved to Mongo because it had support for Arrays and hashes (which I love). But, it doesn't support more than one range using indexes, so my Mongo database is slow..because it can't use indexes and has to scan 10 million documents.

Is there a better alternative. This is holding up moving this application into production (and the revenue that comes with it). Any thoughts as to the right database or alternative architectures would be greatly appreciated.

I'm in Ruby/Rails if that matters.

Community
  • 1
  • 1
Williamf
  • 595
  • 4
  • 14
  • 2
    One you start having queries like *site in tags.. you might have to go with something like http://sphinxsearch.com/ – parapura rajkumar Jul 22 '12 at 16:52
  • I'm not sure why you need support for more than one range querying using an index. If the index is used for several query clauses, are the returned results still so large that they can't be sorted in memory efficiently? Is your data set large enough that you can't create multiple indexes corresponding to different queries? – Asya Kamsky Jul 22 '12 at 18:59
  • is the set of tags overlapping with the set of keywords? or are they non-overlapping sets? In other words, can a particular string be both a keyword and a tag? – Asya Kamsky Jul 22 '12 at 19:02
  • I think there is no reason why you couldn't use an SQL DB. For alternative schemas see here: http://stackoverflow.com/questions/597016/what-tag-schemas-are-the-most-efficient-effective .. and here: http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html – Casper Jul 22 '12 at 19:46
  • A better question than "which database" is probably "which search product". For example, see [Fulltext search in NoSQL databases](http://stackoverflow.com/questions/5453872/full-text-search-in-nosql-databases). It depends if you want search engine features like [faceted search](http://en.wikipedia.org/wiki/Faceted_search), [stemming](http://en.wikipedia.org/wiki/Stemming), [stopwords](http://en.wikipedia.org/wiki/Stopwords), and [boolean operators](http://en.wikipedia.org/wiki/Web_search_query) .. or prefer to create something simpler yourself. Elasticsearch seems a popular open source option. – Stennie Jul 22 '12 at 21:14
  • The tags, keywords & contact_via are overlapping. They are completely separate attributes, but since we don't set the names, they could have the same name, yes. – Williamf Jul 24 '12 at 03:34
  • I've been leaning heavily toward flipping to a search-based product, and leaving the data where it is. I tried hitting Cassandra or Couch and both of those were a big bag of fail. I'll update as I'm testing ElasticSearch. – Williamf Jul 24 '12 at 03:35

2 Answers2

2

When needing to do multiple queries on arrays, we found the best solution, at least for us, was to go with ElasticSearch. We get this, plus some other bonuses. And, we can reduce the index requirements for Mongo.. so it's a win/win.

Williamf
  • 595
  • 4
  • 14
1

My two cents are for MongoDB. Not only can your data be represented, saved, and loaded as raw Ruby hashes, but Mongo is modern and fast, and really, really easy to know. Here's all you need to do to start Mongo server:

mongod --dbpath /path/to/dir/w/dbs

Then to get the console , which is just a basic JavaScript console, just invoke mongo. And using it is just this simple:

require 'mongo'
db = Mongo::Connection.new['somedb']
db.stuff.find #=> []
db.stuff.insert({id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!'})
db.stuff.find #=> [{id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!'}]
db.stuff.update({id: 'abcd', {'$set' => {says: 'Bork bork bork!!!! (Bork)!'}}})
db.stuff.find #=> [{id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!!!! (Bork)!'}]
Linuxios
  • 34,849
  • 13
  • 91
  • 116
  • We're currently in Mongo. But, we can't query it without big scans, because you can only query a single range and use indexes. It won't create an index that has multiple ranges on it (arrays, sort orders, etc). – Williamf Jul 24 '12 at 03:37