4

I am trying to make custom auto generated/incremented key in Kudu which will keep increasing its value -from a starting seed which zero by default.

It's pretty inefficient to go through all records and increment a counter to get a row count.

Does Kudu provide the rows count out of the box? If not, what are the best way to get it?

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
Anas Mosaad
  • 131
  • 2
  • 8

3 Answers3

3

Apache Kudu does not support AUTO_INCREMENT columns at this time. There is a FAQ entry on the Kudu web site that mentions this.

Kudu is a distributed storage engine that is focused on being a good analytical store (OLAP) as opposed to being a good transactional store (OLTP) and it shows in the features we've prioritized so far. This is a good example of that.

Because we're not trying to be an OLTP store, Kudu doesn't yet implement multi-row or multi-node transactions, and so a simple incrementing primary key counter would be difficult to implement correctly at this time -- especially for example when the table is hash-partitioned on the primary key. We'd need a central transaction coordinator that doesn't currently exist.

To answer your second question, getting a row count is currently a little expensive in Kudu as it involves scanning the index column on each tablet and summing up the total count. Apache Impala / Apache Spark SQL will do this transparently for you if you do a SELECT COUNT(*) from kudu_table but I wouldn't currently rely on that for the purposes of assigning a new ID, since Impala currently allows scanning from a slightly stale Kudu replica thus potentially being off on the row count.

The best thing to do right now is rely on some external mechanism to assign row IDs.

Source: I am a PMC member on Apache Kudu.

Michael Percy
  • 293
  • 1
  • 9
2

In addition do @JoeyVanHalens answer, there is another option which is also explained here on SO. You can use row_numer() to create an ID which resembles a counter but does not force you to do some cumbersome nesting or something else if you only want a counter-like column.

Straight forward, it looks like this:

SELECT
    row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as incremented_id
FROM some_table
  • row_number() creates an incremtented number over a partition. Unlike rank(), row_number() ensures you to have an increment even if your partition contains duplicates.
  • PARTITION BY "dummy" interprets a temporary "dummy"-column during runtime as one partition over the entire table. Thus, the increment happens for all records.
  • ORDER BY follows the same "dummy"-logic.

Of course you can also replace "dummy" by whatever column is necessary for your table-logic.

The result looks like:

-- ID = incremented_id

| ID    | some_content |
|-------|--------------|
| 1     | "a"          |
| 2     | "b"          |
| 3     | "c"          |
| 4     | "d"          |
|-------|--------------|
Markus
  • 2,265
  • 5
  • 28
  • 54
1

There are several ways to get around this.

  1. Use impala's uuid() function to generate a unique id.

  2. convert the uuid() to BIGINT (via hashing, etc.)

  3. use impala's unix_timestamp to generate a BIGINT value representing the current date and time as a delta from the Unix epoch (this might cause some collision, so better add another column if you're going to use this as a primary key.

  • Welcome to SO. Please provide some more information as URL to documentation, code samples etc. As it is at the time, it is not providing a solution as needed to fit here. – inetphantom Jun 19 '17 at 12:04
  • UUID's should take up two BIGINTs, unless you want to hash and risk collisions. – Danny Varod Jul 22 '18 at 08:44