How to expose large collection of XML documents (~2M) for offline querying (xpath)?

Question

I have just short of 2 million XML documents sitting on 16gb of file system space. They are all valid and share a single DTD. They are all of roughly equal size (all generated by the same lab information system).

I'm looking for an easy way for a single user to query the whole 2M doc corpus. I'm not looking to expose this to the web or even multiple LAN users; however, I would like it be able to expose some query interface to my intranet. I'm flexible on the query language but I would like to be able to do ad hoc queries. I want it to be at least simi-performant and I'm willing to dedicate additional disk space as needed to accommodate indexes.

A workable solution has to be deplorable on a single quad core Linux box with 8gb of RAM, new hardware isn't an option.

I found e-Xist DB but it doesn't seem to have all that much in the way of activity and the demo site is down.

I can't find any documentation on its behavior with a corpus of this size. Do you know / know of any documentation? — Finn, Feb 03 '12 at 00:35
Let me quote Wolfgang Meier, the original developer of eXist-db, who wisely advises when questions of "how will my data perform" arise: "Performance pretty much depends on the structure of your data and the queries you try to run. The raw data size doesn't mean much. You rather need to look at the number of elements you want to query and how well eXist can optimize the queries..." This is not in documentation but is discussed often on exist-open; see http://markmail.org/message/pzxzuyd5ugm6oxff. You can do much to optimize your queries too: http://exist-db.org/tuning.html. — Joe Wicentowski, Feb 03 '12 at 23:30

Francis Avila · Accepted Answer · 2012-02-03T18:23:03.243

3

I would try in this order:

BaseX (Has nice GUI. Most promising open source XML db I've found. BSD license)
Sedna (My favorite before BaseX. Apache 2.0 license)
Berkeley DB-XML (Is an embedded flat-file DB. Sleepycat license)
eXist (eXist has always been a hacky disaster. GNU LGPL license)

My hunch is that Berkeley would be the fastest, but BaseX and Sedna are both network-accessible and BaseX would be the easiest to start using and querying. Sedna also has a schema-aware storage system which might be beneficial for the situation you describe. Berkeley's sleepycat license may be an encumbrance for you if you have a commercial use--look at it carefully.

edited Feb 03 '12 at 18:23

answered Feb 03 '12 at 02:49

Francis Avila

31,233
6
58
96

In theory I'm most comfortable with the pure Unix style of [BDB-XML](http://www.theregister.co.uk/2007/07/18/berkeley_db_xml/); however, I also remember the [sleepycat licence](http://en.wikipedia.org/wiki/Sleepycat_License) [concerns](http://stackoverflow.com/questions/1493341/is-berkeley-db-xml-a-viable-database-backend). Can you elaborate on your concerns with eXist? – Finn Feb 03 '12 at 17:24
You have already run into eXist problems: complex to set up, patchy documentation (complicated by old documentation which applies only to the previous version, still in wide use), and not much activity on the project. The other options here have commercial backing, readable documentation, and are simpler to get started. All of them have command lines and interface libraries in popular languages. I'm not sure how BDB-XML is more "pure Unix style" than any other. Updated answer with licenses. – Francis Avila Feb 03 '12 at 18:20

Aravind Yarram · Answer 2 · 2012-02-03T02:57:26.077

1

My preference is to create inverted index using full-text search engine. Below are my preferences. I suggest you spend time on researching these 3.

Solr (Web interface for querying, easy to get started)
ElasticSearch (Distributed, easy to get started)
Raw Lucene (1 & 2 use Lucene behind the scenes)

Why full-text-search engines?

Faster
Highlighting
Faceting
Allows free-form search (with xml dbs you will be working against xpath or xquery or something)
Proven to search faster even with huge set of files
file-based

edited Feb 03 '12 at 02:57

answered Feb 03 '12 at 01:02

Aravind Yarram

78,777
46
231
327

Thanks for the suggestion. Should I assume those preferences are sorted? Also why do you prefer the inverted index approach? – Finn Feb 03 '12 at 01:51

score 1 · Answer 3 · answered Feb 03 '12 at 11:58

1

You definitely want an XML database. I would say the emerging leaders are MarkLogic for a commercial product, eXist for open source. Others might have other views. Getting to grips with a new database product is always a steep learning curve (and the more capable the database, the more there is to learn). But eXist can certainly hack it, don't give up at the first hurdle.

answered Feb 03 '12 at 11:58

Michael Kay

156,231
11
92
164

1

Yeah I've installed eXist and I've got to say it reminded me of why I hate the whole Java ecosystem. The install went perfectly, it looks like a great product, it's got a million moving parts connected in totally opaque ways... – Finn Feb 03 '12 at 17:31
Good to hear you're up and running. One tip: If you're hitting any hurdles with eXist-db, I'd suggest posting your question on exist-open - see https://lists.sourceforge.net/lists/listinfo/exist-open. Some exist folks are here on stackoverflow, but the quickest route to an answer is definitely there (you could even post this exact question there). eXist development is actually **quite** active - as you'll see by looking at subversion logs. eXist is a great project with an active user base, a talented group of core developers, and a friendly, helpful community. – Joe Wicentowski Feb 03 '12 at 18:44

score 1 · Answer 4 · edited Feb 05 '12 at 10:23

I agree with Michale Kay. Use eXist-db if you want open source and MarkLogic if you want commercial. I did a project for the US library of congress NDIIPP program and after an extensive ATAM analysis and we selected eXist as superior to the other systems due to its active user community and widespread use. If you have doubts just do a search on MarkMail. I think you will find that eXist has a more active discussion than any other system.

There are about 350 pages of the report on line here:

http://www.mnhs.org/preserve/records/legislativerecords/pilot.htm

How to expose large collection of XML documents (~2M) for offline querying (xpath)?

4 Answers4