4

I have a problem...

I need to store a daily barrage of about 3,000 mid-sized XML documents (100 to 200 data elements).

The data is somewhat unstable in the sense that the schema changes from time to time and the changes are not announced with enough advance notice, but need to be dealt with retroactively on an emergency "hotfix" basis.

The consumption pattern for the data involves both a website and some simple analytics (some averages and pie charts).

MongoDB seems like a great solution except for one problem; it requires converting between XML and JSON. I would prefer to store the XML documents as they arrive, untouched, and shift any intelligent processing to the consumer of the data. That way any bugs in the data-loading code will not cause permanent damage. Bugs in the consumer(s) are always harmless since you can fix and re-run without permanent data loss.

I don't really need "massively parallel" processing capabilities. It's about 4GB of data which fits comfortably in a 64-bit server.

I have eliminated from consideration Cassandra (due to complex setup) and Couch DB (due to lack of familiar features such as indexing, which I will need initially due to my RDBMS ways of thinking).

So finally here's my actual question...

Is it worthwhile to look for a native XML database, which are not as mature as MongoDB, or should I bite the bullet and convert all the XML to JSON as it arrives and just use MongoDB?

Joe Wicentowski
  • 5,159
  • 16
  • 26
Alex R
  • 11,364
  • 15
  • 100
  • 180
  • I'm not sure why you need MongoDB if you just want to store files? What indexing do you need that CouchDB can't do, especially if you just treat the docs as files/attachments? – WiredPrairie Sep 13 '13 at 01:49
  • I get them as files but I don't want to store them as files, because I need to query them in flexible ways without writing a tone of code. – Alex R Sep 13 '13 at 02:58
  • Have you tried converting some of your data and your queries? You'll find that there are lots of ways to do it, no necessarily right ways, and lots of things you'll need to worry about regarding performance, etc. – WiredPrairie Sep 13 '13 at 10:46

2 Answers2

4

You may have a look at BaseX, (Basex.org), with built in XQuery processor and Lucene text indexing.

Lars GJ
  • 356
  • 2
  • 8
2

That Data Volume is Small

If there is no need for parallel data processing, there is no need for Mongo DB. Especially if dealing with small data amounts like 4GB, the overhead of distributing work can easily get larger than the actual evaluation effort.

4GB / 60k nodes is not large of XML databases, either. After some time of getting into it you will realize XQuery as a great tool for XML document analysis.

Is it Really?

Or do you get daily 4GB and have to evaluate that and all data you already stored? Then you will get to some amount which you cannot store and process on one machine any more; and distributing work will get necessary. Not within days or weeks, but a year will already bring you 1TB.

Converting to JSON

How does you input look like? Does it adhere any schema or even resemble tabular data? MongoDB's capabilities for analyzing semi-structured are way worse than what XML databases provide. On the other hand, if you only want to pull a few fields on well-defined paths and you can analyze one input file after the other, Mongo DB probably will not suffer much.

Carrying XML into the Cloud

If you want to use both an XML database's capabilities in analyzing the data and some NoSQL's systems capabilities in distributing the work, you could run the database from that system.

BaseX is getting to the cloud with exactly the capabilities you need -- but it will probably still take some time for that feature to get production-ready.

Jens Erat
  • 37,523
  • 16
  • 80
  • 96
  • In what way is MongoDb "way worse" than an XML database for analytical purposes? – WiredPrairie Sep 13 '13 at 10:40
  • 1
    XML databases are built for querying large and complex tree structured data; Mongo DB is built for aggregating over large sets of small JSON documents. There are huge differences in data storage, index structures and chosen algorithms. It heavily depends on what kind of data you have and how you query it, massive amounts of small files which gain from distributing evaluation will probably be faster in Mongo DB, few large (not [easily] chunkable files) will probably be faster when processed by a native XML DB. – Jens Erat Sep 13 '13 at 14:03
  • Do you have evidence to back your statements and claims in your answer? A BSON document can be 16MB in size ... I'd consider that to be large. – WiredPrairie Sep 13 '13 at 14:20
  • 1
    "Large" in case of XML Databases starts in GB range and goes into TBs. 16MB is _tiny_. It's all a matter of what you want to do in the end; if you're just aggregating some kinds of logs you will probably go better with MongoDB, if you're doing more complicated and repeated analysis (eg. involving more than one of those described files) XML databases will probably be the better deal. Or go with Marklogic which sits somewhere in between, but is commercial. – Jens Erat Sep 13 '13 at 15:38