Kudu auto generated key column

Question

I am trying to make custom auto generated/incremented key in Kudu which will keep increasing its value -from a starting seed which zero by default.

It's pretty inefficient to go through all records and increment a counter to get a row count.

Does Kudu provide the rows count out of the box? If not, what are the best way to get it?

score 3 · Accepted Answer · answered May 21 '17 at 03:54

Apache Kudu does not support AUTO_INCREMENT columns at this time. There is a FAQ entry on the Kudu web site that mentions this.

Kudu is a distributed storage engine that is focused on being a good analytical store (OLAP) as opposed to being a good transactional store (OLTP) and it shows in the features we've prioritized so far. This is a good example of that.

Because we're not trying to be an OLTP store, Kudu doesn't yet implement multi-row or multi-node transactions, and so a simple incrementing primary key counter would be difficult to implement correctly at this time -- especially for example when the table is hash-partitioned on the primary key. We'd need a central transaction coordinator that doesn't currently exist.

To answer your second question, getting a row count is currently a little expensive in Kudu as it involves scanning the index column on each tablet and summing up the total count. Apache Impala / Apache Spark SQL will do this transparently for you if you do a SELECT COUNT(*) from kudu_table but I wouldn't currently rely on that for the purposes of assigning a new ID, since Impala currently allows scanning from a slightly stale Kudu replica thus potentially being off on the row count.

The best thing to do right now is rely on some external mechanism to assign row IDs.

Source: I am a PMC member on Apache Kudu.

Markus · Answer 2 · 2022-08-10T20:46:58.420

In addition do @JoeyVanHalens answer, there is another option which is also explained here on SO. You can use row_numer() to create an ID which resembles a counter but does not force you to do some cumbersome nesting or something else if you only want a counter-like column.

Straight forward, it looks like this:

SELECT
    row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as incremented_id
FROM some_table

row_number() creates an incremtented number over a partition. Unlike rank(), row_number() ensures you to have an increment even if your partition contains duplicates.
PARTITION BY "dummy" interprets a temporary "dummy"-column during runtime as one partition over the entire table. Thus, the increment happens for all records.
ORDER BY follows the same "dummy"-logic.

Of course you can also replace "dummy" by whatever column is necessary for your table-logic.

The result looks like:

-- ID = incremented_id

| ID    | some_content |
|-------|--------------|
| 1     | "a"          |
| 2     | "b"          |
| 3     | "c"          |
| 4     | "d"          |
|-------|--------------|

score 1 · Answer 3 · answered Jun 19 '17 at 11:37

1

There are several ways to get around this.

Use impala's uuid() function to generate a unique id.
convert the uuid() to BIGINT (via hashing, etc.)
use impala's unix_timestamp to generate a BIGINT value representing the current date and time as a delta from the Unix epoch (this might cause some collision, so better add another column if you're going to use this as a primary key.

answered Jun 19 '17 at 11:37

Joey Van Halen

21
2

Welcome to SO. Please provide some more information as URL to documentation, code samples etc. As it is at the time, it is not providing a solution as needed to fit here. – inetphantom Jun 19 '17 at 12:04
UUID's should take up two BIGINTs, unless you want to hash and risk collisions. – Danny Varod Jul 22 '18 at 08:44

Kudu auto generated key column

3 Answers3

Linked