Which embedded database capable of 100 million records has an efficient C or C++ API

Question

I'm looking for a cross-platform database engine that can handle databases up hundreds of millions of records without severe degradation in query performance. It needs to have a C or C++ API which will allow easy, fast construction of records and parsing returned data.

Highly discouraged are products where data has to be translated to and from strings just to get it into the database. The technical users storing things like IP addresses don't want or need this overhead. This is a very important criteria so if you're going to refer to products, please be explicit about how they offer such a direct API. Not wishing to be rude, but I can use Google - please assume I've found most mainstream products and I'm asking because it's often hard to work out just what direct API they offer, rather than just a C wrapper around SQL.

It does not need to be an RDBMS - a simple ISAM record-oriented approach would be sufficient.

Whilst the primary need is for a single-user database, expansion to some kind of shared file or server operations is likely for future use.

Access to source code, either open source or via licensing, is highly desirable if the database comes from a small company. It must not be GPL or LGPL.

I'm wondering just how many databases even fill this requirement, particular with the last statement about source code. — BobbyShaftoe, Feb 08 '09 at 03:05
The requirement comes from a real database but for non-technical reasons we are exploring alternatives. I'm avoiding mentioning names to see what others suggest ;-) — Andy Dent, Feb 08 '09 at 03:08
BobbyShaftoe: many, many solutions are possible... See the whole thread. — bortzmeyer, Feb 08 '09 at 11:21
I submitted a re-open and changed the title because I think the guys who jumped on this and voted to close it are just chasing points as editors - the title previously asked for "best" product but the question is very clear in a long list of requirements to be met and the question has been voted as helpful by a number of people over a long period. — Andy Dent, Nov 03 '14 at 06:12

score 9 · Accepted Answer · answered Feb 08 '09 at 06:27

9

you might consider C-Tree by FairCom - tell 'em I sent you ;-)

answered Feb 08 '09 at 06:27

Steven A. Lowe

60,273
18
132
202

1

c-tree is actually the engine currently being used. Changes in licensing and other considerations are driving me to look for alternatives. This site may stay using c-tree but presents a great performance example for others to meet. – Andy Dent Feb 11 '09 at 21:16
@[Andy Dent]: it's going to be hard to beat C-Tree's performance, they've been on top of that game for over 20 years! Ray Brown (President) and Randal Hoff (Dir. Biz. Dev.) were very willing to work with me on licensing back in 2002 for another company, talk to them before you give up. – Steven A. Lowe Feb 12 '09 at 02:52

score 8 · Answer 2 · answered Feb 21 '11 at 14:30

8

i'm the author of hamsterdb.

tokyo cabinet and berkeleydb should work fine. hamsterdb definitely will work. It's a plain C API, open source, platform independent, very fast and tested with databases up to several hundreds of GB and hundreds of million items.

If you are willing to evaluate and need support then drop me a mail (contact form on hamsterdb.com) - i will help as good as i can!

bye Christoph

answered Feb 21 '11 at 14:30

cruppstahl

2,447
1
19
25

Hamsterdb evolved into upscaleddb which looked even more interesting but was effectively abandoned in 2017 according to the blog https://upscaledb.com/blog/0015-release-of-upscaledb-2.2.1.html – Andy Dent Feb 12 '19 at 07:00

score 4 · Answer 3 · answered Feb 08 '09 at 05:29

You didn't mention what platform you are on, but if Windows only is OK, take a look at the Extensible Storage Engine (previously known as Jet Blue), the embedded ISAM table engine included in Windows 2000 and later. It's used for Active Directory, Exchange, and other internal components, optimized for a small number of large tables.

It has a C interface and supports binary data types natively. It supports indexes, transactions and uses a log to ensure atomicity and durability. There is no query language; you have to work with the tables and indexes directly yourself.

ESE doesn't like to open files over a network, and doesn't support sharing a database through file sharing. You're going to be hard pressed to find any database engine that supports sharing through file sharing. The Access Jet database engine (AKA Jet Red, totally separate code base) is the only one I know of, and it's notorious for corrupting files over the network, especially if they're large (>100 MB).

Whatever engine you use, you'll most likely have to implement the shared usage functions yourself in your own network server process or use a discrete database engine.

voted up for being informative answer and reminding me that I forgoot to say it had to be cross-platform — Andy Dent, Feb 08 '09 at 11:33

score 3 · Answer 4 · answered Oct 08 '13 at 13:48

3

For anyone finding this page a few years later, I'm now using LevelDB with some scaffolding on top to add the multiple indexing necessary. In particular, it's a nice fit for embedded databases on iOS. I ended up writing a book about it! (Getting Started with LevelDB, from Packt in late 2013).

answered Oct 08 '13 at 13:48

Andy Dent

17,578
6
88
115

1

LevelDB is really a poor choice in terms of performance and reliability. The LevelDB design is inherently corruption-prone and the code quality itself is sub-par. See this crash-vulnerability report http://wisdom.cs.wisc.edu/workshops/spring-14/talks/Thanu.pdf Only LMDB is proven crash-proof, and it is fully cross-platform. – hyc Aug 04 '14 at 17:08
5

hyc is not a disinterested party in recommended LMDB over LevelDB, for anyone who reads his comment and doesn't drill down further into his affliations. I'm not qualified to comment on the veracity of his opinion as I've not compared the two engines in detail but he has been following LevelDB postings around and pointing people to LMDB. – Andy Dent Nov 03 '14 at 06:16
1

Yes, I'm the author of LMDB. But independent research confirms what I've said. You may not have compared the two engines in detail but many others have. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai – hyc Nov 04 '14 at 12:19

score 2 · Answer 5 · edited Feb 08 '09 at 03:41

2

One option could be Firebird. It offers both a server based product, as well as an embedded product.

It is also open source and there are a large number of providers for all types of languages.

edited Feb 08 '09 at 03:41

Mitch Wheat

295,962
43
465
541

answered Feb 08 '09 at 03:40

Jayden

2,656
2
26
31

thanks (Hi Mitch), I even have the Firebird book but APIs like http://www.ibpp.org/ are biased towards the SQL side and the pure C API comes with massive disclaimers: http://www.firebirdfaq.org/faq9/ – Andy Dent Feb 08 '09 at 04:07
@Andy Dent: it was jaydenm who answered; I just edited to add the firebird link ;-) – Mitch Wheat Feb 08 '09 at 04:45

SquareCog · Answer 6 · 2009-02-15T21:46:33.933

I believe what you are looking for is BerkeleyDB: http://www.oracle.com/technology/products/berkeley-db/db/index.html

Never mind that it's Oracle, the license is free, and it's open-source -- the only catch is that if you redistribute your software that uses BerkeleyDB, you must make your source available as well -- or buy a license.

It does not provide SQL support, but rather direct lookups (via b-tree or hash-table structure, whichever makes more sense for your needs). It's extremely reliable, fast, ACID, has built-in replication support, and so on.

Here is a small quote from the page I refer to above, that lists a few features:

Data Storage

Berkeley DB stores data quickly and easily without the overhead found in other databases. Berkeley DB is a C library that runs in the same process as your application, avoiding the interprocess communication delays of using a remote database server. Shared caches keep the most active data in memory, avoiding costly disk access.

Local, in-process data storage

Schema-neutral, application native data format

Indexed and sequential retrieval (Btree, Queue, Recno, Hash)

Multiple processes per application and multiple threads per process

Fine grained and configurable locking for highly concurrent systems

Multi-version concurrency control (MVCC)

Support for secondary indexes

In-memory, on disk or both

Online Btree compaction

Online Btree disk space reclamation

Online abandoned lock removal

On disk data encryption (AES)

Records up to 4GB and tables up to 256TB

Update: Just ran across this project and thought of the question you posted: http://tokyocabinet.sourceforge.net/index.html . It is under LGPL, so not compatible with your restrictions, but an interesting project to check out, nonetheless.

I'm very nervous whenever a vendor like Oracle ask you to email them for pricing information without giving even an indication of price brackets. — Andy Dent, Feb 08 '09 at 06:55
That's a fair concern. Do note that this stuff is free if you do not distribute (or distribute your source). Also you can try to get the pre-acquisition releases; they are just as stable, and don't have Oracle attached to them. — SquareCog, Feb 08 '09 at 07:16

score 1 · Answer 7 · answered Feb 08 '09 at 03:48

1

SQLite would meet those criteria, except for the eventual shared file scenario in the future (and actually it could probably do that to if the network file system implements file locks correctly).

answered Feb 08 '09 at 03:48

DougN

4,407
11
56
81

SQLite stores all the data as strings. I/m not sure, but I thought the OP's specific reference about storing database data as strings was some sort of cryptic hint that he had already looked at SQLite. Personally, I like SQLite as well. – jussij Feb 08 '09 at 04:21
I wasn't trying to hint anything - the existing API is able to pour direct binary values into records and I've already been informed that the largest user of the system is not interested in changing that mode of operations. With these data volumes, would you really want to map ints to strings? – Andy Dent Feb 08 '09 at 06:57
1

There is a SQLite-based true client-server commercial product out there with a C++ SDK - http://www.realsoftware.com/realsqlserver/ – Andy Dent Feb 08 '09 at 06:58
3

SQLite does not store store ints as strings (I think it may have, but that was at least a year or two ago, if not more). In fact, it uses variable-sized binary ints so it can store bytes and int64's in the same column, and will use the least space possible. – DougN Feb 11 '09 at 00:35

bortzmeyer · Answer 8 · 2009-02-08T11:35:46.893

1

Many good solutions (such as SQLite) have been mentioned. Let me add two, since you don't require SQL:

HamsterDB fast, simple to use, can store arbitrary binary data. No provision for shared databases.
Glib HashTable module seems quite interesting too and is very common so you won't risk going into a dead end. On the other end, I'm not sure there is and easy way to store the database on the disk, it's mostly for in-memory stuff

I've tested both on multi-million records projects.

edited Feb 08 '09 at 11:35

answered Feb 08 '09 at 11:26

bortzmeyer

34,164
12
67
91

thanks, both look interesting. Whilst Glib is only apparently available under LGPL, that doesn't rule it out (although I'm cautious about LGPL as it limits your build options lest it devolve into GPL). – Andy Dent Feb 08 '09 at 11:46

score 1 · Answer 9 · answered Jun 02 '09 at 08:31

As you are familiar with Fairtree, then you are probably also familiar with Raima RDM.

It went open source a few years ago, then dbstar claimed that they had somehow acquired the copyright. This seems debatable though. From reading the original Raima license, this does not seem possible. Of course it is possible to stay with the original code release. It is rather rare, but I have a copy archived away.

score 0 · Answer 10 · answered Jul 17 '09 at 14:28

0

There used to be a product called b-trieve but I'm not sure if source code was included. I think it has been discontinued. The only database engine I know of with an ISAM orientation is c-tree.

answered Jul 17 '09 at 14:28

score 0 · Answer 11 · answered Feb 08 '09 at 06:13

0

SQLite tends to be the first option. It doesn't store data as strings but I think you have to build a SQL command to do the insertion and that command will have some string building.

BerkeleyDB is a well engineered product if you don't need a relationDB. I have no idea what Oracle charges for it and if you would need a license for your application.

Personally I would consider why you have some of your requirements . Have you done testing to verify the requirement that you need to do direct insertion into the database? Seems like you could take a couple of hours to write up a wrapper that converts from whatever API you want to SQL and then see if SQLite, MySql,... meet your speed requirements.

answered Feb 08 '09 at 06:13

hacken

2,135
11
11

ambitious underlings have tried to replace the original twice with SQL engines - the size is a real reqirement and the existing mappings into C++ code work very well with a binary API – Andy Dent Feb 08 '09 at 06:51
1

Oh snap. I am going to have to start using "ambitious underling" as a sig line. It has the right ring to it. – SquareCog Feb 08 '09 at 07:19
Not my line nor were they my underlings but feel free to appropriate the term ;-) – Andy Dent Feb 09 '09 at 03:43
Yes it seems pretty firm in http://www.sqlite.org/cintro.html that you're using an indirect SQL-based API with no workaround. – Andy Dent Feb 11 '09 at 20:58

Which embedded database capable of 100 million records has an efficient C or C++ API

11 Answers11

Data Storage

Linked