2

I am using Impala for querying parquet-tables and cannot find a solution to increment an integer-column ranging from 1..n. The column is supposed to be used as ID-reference. Currently I am aware of the uuid() function, which

Returns a universal unique identifier, a 128-bit value encoded as a string with groups of hexadecimal digits separated by dashes.

Anyhow, this is not suitable for me since I have to pass the ID to another system which requests an ID in style of 1..n. I also already know that Impala has no auto-increment-implementation.

The desired result should look like:

-- UUID() provided as example - I want to achieve the `my_id`-column.

| my_id | example_uuid | some_content |
|-------|--------------|--------------|
| 1     | 50d53ca4-b...| "a"          |
| 2     | 6ba8dd54-1...| "b"          |
| 3     | 515362df-f...| "c"          |
| 4     | a52db5e9-e...| "d"          |
|-------|--------------|--------------|

How can I achieve the desired result (integer-ID ranging from 1..n)?


Note: This question differs from this one which specifically handles Kudu-tables. However, answers should be applicable for this question as well.

Markus
  • 2,265
  • 5
  • 28
  • 54

2 Answers2

3

Since other Q&A's like this one only came up with uuid()-alike answers, I put some thought in it and finally came up with this solution:

SELECT
    row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as my_id
    , some_content
FROM some_table
  • row_number() generates a continuous integer-number over a provided partition. Unlike rank(), row_number() always provides an incremented number on its partition (even if duplicates occur)
  • PARTITION BY "dummy" partitions the entire table into one partition. This works since "dummy" is interpreted in the execution graph as temporary column yielding only the String-value "dummy". Thus, also something analog to "dummy" works.
  • ORDER BY is required in order to generate the increment. Since we don't care about the order in this example (otherwise just set your respective column), also use the "dummy"-workaround.

The command creates the desired incremental ID without any nested SQL-statements or other tricks.

| my_id | some_content |
|-------|--------------|
| 1     | "a"          |
| 2     | "b"          |
| 3     | "c"          |
| 4     | "d"          |
|-------|--------------|
Markus
  • 2,265
  • 5
  • 28
  • 54
0

I used Markus's answer for a large partitioned table and found that I was getting duplicate ids. I think the ids were only unique within their partition; possibly PARTITION BY "dummy" leads Impala to think that each partition can execute row_number() on its own. I was able to get it working by specifying an actual column to order by and no partition by:

SELECT
    row_number() OVER (ORDER BY actual_column) as my_id
    , some_content
FROM some_table

It doesn't seem to matter whether the values in the column are unique (mine weren't), but using the actual partition key might result in the same issue as the "dummy" column.

Understandably, it took a lot longer to run than the dummy version.

Stephen
  • 824
  • 1
  • 8
  • 16
  • 1
    Hey Stephen, your answer seems rather to be a comment because its depending on your specific configuration or table structur. Note that `PARTITION BY` has nothing to do with your actual physical partitioning (read the documentation [here](https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_analytic_functions.html#over)) and acts like a `GROUP BY`. Thus, if you run into issues you have to check if you grouped something in your `some_content` by accident. Answers from SO always have to be adapted to the own problem. – Markus Mar 14 '22 at 10:15