1

As far as I know, the DuckDB is columnar database and can process and store sparse data efficiently.

So, would it be possible to use it as "tuple space" or "document database"? I don't expect to get top performance from DuckDB in such use case, good enough performance would be enough.

There going to be one huge table with type, title, author, duration, director, note, todo columns, with index on each, storing following objects (10 millions of objects):

{ type: "book", title: "Art of War", author: "Sun Tzu" }
{ type: "book", title: "Fooled by Randomness", author: "Nassim Taleb" },

{ type: "movie", title: "Total Recall", duration: 1.5, director: "Verh" },

{ type: "note", note: "Earth moves around Sun" }

{ type: "todo", todo: "Buy milk" }

Typical queries will be: select list of objects:

select * from objects where type = "book" and author in ["Taleb", "Sun Tzi"]

get group stats for given filter

select type, count(*)
from objects
where author in ["Taleb", "Sun Tzi"]
group by type
Alex Craft
  • 13,598
  • 11
  • 69
  • 133
  • "olumnar database and can process and store sparse data efficiently." - where are you seeing references to "sparse" data w.r.t. DuckDB? – Dai Jul 02 '23 at 04:04
  • @Dai a) process efficiently - I assume, that as soon as it's columnar and don't need to load the whole row, it should be able to process sparse data efficiently b) store efficiently, I found this github issue about storage of sparse data https://github.com/duckdb/duckdb/issues/632 – Alex Craft Jul 02 '23 at 04:06
  • I don't think column-store DBs have data-storage-efficiency advantages over traditional rowstores (but they do have significant performance advantages for _certain_ workloads) - though I imagine columnstores _probably_ compress better too, but table-compression is a double-edged sword. As for that github issue you linked-to: they describe their compression system working well for sparse-tables (i.e. tables where most "cells" are `NULL`), but that's to be expected: all major RDBMS today offer efficient storage of sparse-table and columnstore too... – Dai Jul 02 '23 at 04:10

1 Answers1

2

As far as I know, the DuckDB is columnar database and can process and store sparse data efficiently.

It is indeed a column-store database, but I haven't seen any technical information (especially not any benchmarks) that demonstrate that DuckDB is any more space-efficient at storing sparse tables (i.e. mostly NULL) than other RDBMS, columnar or rowstore. Whereas many incumbent RDMBS will prove difficult for DuckDB to unseat: for example, MS SQL Server supports both Sparse Tables and Columnar-storage too; while Postgres users will gladly remind you that NULLs are cheap there too.

DuckDB's main selling-point is that it's an in-proc columnar database - so it's ideal for (say) SQLite or JET (Red or Blue) users who need/want to stay on an in-proc solution but need columnar storage for their application.

So, would it be possible to use it as "tuple space" or "document database"? I don't expect to get top performance from DuckDB in such use case, good enough performance would be enough.

There going to be one huge table with type, title, author, duration, director, note, todo columns, with index on each, storing following objects (10 millions of objects)

What you're describing there is not what I'd call a "document" store because you aren't storing a single document blob addressed by a single key - and you aren't parsing-and-rewriting blob data to edit it, on the contrary: your data is largely normalized already; I'd refer to your design as being best-suited for a wide column store-type database (like Apache Cassandra and Azure Storage Tables). However, wide-column-store DBs also tend to be schema-free: that is, columns are created on-the-fly to correspond to the member-properties of the entities you store in the table: there's no requirement to use defined columns, and there's no way to constrain data to use those columns either.

...whereas DuckDB is like all the other RDBMS: you can only INSERT INTO columns you defined in your prior CREATE TABLE/ALTER TABLE statements - which necessarily restricts its appropriateness for wide-column-store scenarios: if you have a new entity property you want to store you'll have to run DDL statements like ALTER TABLE - which often (but not always) require a full table-rebuild. In fact, DuckDB's documentation (as of July 2023) warns that ALTER TABLE is not a command you'll want to be using very often...

https://duckdb.org/docs/sql/statements/alter_table.html

ALTER TABLE

[...]

  • At the moment DuckDB will not allow you to alter a table if there are any dependencies. That means that if you have an index on a column you will first need to drop the index, alter the table, and then recreate the index. Otherwise you will get a “Dependency Error.”

ADD TABLE

[...]

  • The new column will be filled with the specified default value, or NULL if none is specified.

I'm not a DuckDB expert (or even a user), but the fact they say new columns will be "filled" implies that adding new columns is somewhat expensive, whereas other RDBMS (especially traditional row-store DBs) can add new columns instantly without needing to "fill" them on-disk.

...which makes me think that DuckDB is less-suitable for wide-column-store-style databases; see, in your example, you use the type property/column to indicate which other properties/columns are expected to exist (e.g. type: 'note' implies a textual note column exists, while type: 'book' implies author and title columns) - so if you expect to oftenly add more possible types, with their associated columns, then DuckDB is probably not the best choice. (Though I do like how DuckDB supports CHECK constraints, which you can use to enforce a schema even in a poor-man's-polymorphic table as in your example - though CHECK constraints are a feature of all major RDBMS).


As a wry aside, I strongly suggest everyone just read Codd's paper, and grok some Relational Algebra - once you internalize it you'll never need nor want to use a denormalized store like Key/Value, Document, Wide-Column Stores, et cetera). (Anecdotally, the only people I've ever seen willingly move from RDBMS to Document-store were people who didn't want to learn SQL - or had no clue about indexes). (While I have yet to be wooed-over by a Document-store system, other DB types, such as Graph databases like Neo4j, or vector-databases, are fantastic - and key/value blob stores are also incredibly useful when used appropriately (e.g. S3) - my advice is to just steer-clear of schema-free "JSON"-document-style stores.

Dai
  • 141,631
  • 28
  • 261
  • 374