7

I am trying to add a UUID column to my dataset.

getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toString())).show(false);

But the result is all the rows have the same UUID. How can i make it unique?

+-----------------------------------+
uniqueId                            |
+----------------+-------+-----------
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
----------+----------------+--------+
Adiant
  • 859
  • 4
  • 16
  • 34
  • 1
    Check below link https://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator?rq=1 – D-2020365 Apr 09 '18 at 15:14
  • No, i tried the solution in the link, it uses lit, which is not the right solution. – Adiant Apr 09 '18 at 15:39

1 Answers1

12

Updated (Apr 2021):

Per @ferdyh, there's a better way using the uuid() function from Spark SQL. Something like expr("uuid()") will use Spark's native UUID generator, which should be much faster and cleaner to implement.

Originally (June 2018):

When you include UUID as a lit column, you're doing the same as including a string literal.

UUID needs to be generated for each row. You could do this with a UDF, however this can cause problems as UDFs are expected to be deterministic, and expecting randomness from them can cause issues when caching or regeneration happen.

Your best bet may be generating a column with the Spark function rand and using UUID.nameUUIDFromBytes to convert that to a UUID.

Originally, I had:

val uuid = udf(() => java.util.UUID.randomUUID().toString)
getDataset(Transaction.class).withColumn("uniqueId", uuid()).show(false);

which @irbull pointed out could be an issue.

Benjamin Manns
  • 9,028
  • 4
  • 37
  • 48
  • Thanks a lot Benjamin. This solution is working. In java creating a UDF is bit more tedious. UDF need to created ad registered like below: static UDF1 uniqueId= types -> UUID.randomUUID().toString(); sparkSession.udf().register("uId", mode, DataTypes.StringType); – Adiant Apr 10 '18 at 08:55
  • 5
    There are two problems with this solution. 1. UUID.randomUUID() is not guaranteed to be unique across nodes. It uses a pseudo-random number, which is fine on a single machine, but in a cluster environment, you could get collisions. 2. UDFs should be deterministic. That is, for the same input you get the same output (spark reserves the right to cache, reuse results, etc...), or call the same method multiple times if it chooses. https://stackoverflow.com/questions/42960920/spark-dataframe-random-uuid-changes-after-every-transformation-action – irbull Jun 13 '18 at 23:02
  • 2
    Great point @irbull - I'll update to reflect. – Benjamin Manns Jun 14 '18 at 15:33
  • 3
    @irbull then what would be a good way to generate new unique ids when appending rows to a dataframe? `monotonically_increasing_id` + `last stored monotonically_increasing_id`? – Mehdi Jan 27 '19 at 17:34
  • 1
    this generate unique id for each rowcsv.withColumn("uuid", monotonically_increasing_id()+monotonically_increasing_id) – Priyanshu Singh Jan 23 '20 at 05:27
  • Just to be sure using `monotonically_increasing_id` with custom logic to add last max value will not work as the monotonically increasing id is calculated based on partition and row number, so the same dataframe can have values from 0 to 100 and 854654 to 854659 – aweis Feb 20 '20 at 15:43
  • 1
    Interesting enough spark sql does have support for this https://issues.apache.org/jira/browse/SPARK-23599 In a way that is deterministic between retries. – Jelmer Mar 30 '20 at 07:28
  • 1
    This is really slow solution and there is a _much_ simpler solution: Spark has it's own uuid function. UDFs are bad for your spark code -- they are the golden hammer of spark coding. – Marco Jul 07 '21 at 09:16
  • 1
    uuid function isn't in the scala api yet, so you'd have to do something like ```expr("uuid()")``` – ferdyh Aug 24 '21 at 13:02