0

Fairly new to Dask but just wondering why it is behaving in such strange way. Essentially, I create a new column with random uuids and join it to another dask dataframe. For some odd reason the uuids keep changing and not sure if I am missing something?

This is a representation of my code:

def generate_uuid() -> str:
    """ generates uuid4 id """

    return str(uuid4())

my_dask_data = dd.from_pandas(my_pandas_data, npartitions=4)
my_dask_data["uuid"] = None
my_dask_data["uuid"] = my_dask_data.apply(generate_uuid, axis=1, meta=("uuid"), "str"))
print(my_dask_data.compute())

And this is the output:

name       uuid
my_name_1  16fb858c-bbed-413b-a415-62099ee2c455
my_name_2  9acd0a22-9b19-4db6-9759-b70dc0353710
my_name_3  5d610aaf-a813-4d0b-8d83-8f11fe400c7e

Then, I do a concat with other dask dataframe:

joined_data = dd.concat([my_dask_data, my_other_dask_data], axis=1)
print(joined_data.compute())

This is the output, which for some reason it produces new uuids:

name       uuid                                  tests
my_name_1  f951cefa-1145-411c-96f6-924730d7cb22  test1
my_name_2  88e28e5f-42ea-4fbe-a036-b8179a0ba3f8  test2
my_name_3  50e70fac-da19-4d2f-b6ea-80da41591ac5  test3

Any thoughts on how to keep the same uuids without changing?

1 Answers1

0

Dask does not keep your data in memory, by design - this is a huge attractive feature of dask. So every time you compute, your function will be executed again. Since uuid4() is based on a random number generator, different results each time are expected. In fact, UUIDs are never supposed to repeat.

The question is, what would you like to happen, what is your actual workflow? You might be interested in reading this SO question: How to generate a random UUID which is reproducible (with a seed) in Python

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Hi Mdurant, I am trying to create a "foreign-key-relationship-idea" between items, each item having its own uuid. The use of UUIDs is compulsory. You could see it as having different types of cars where Ford, Seat or Subaru have their own "brand_uuid". At the same time, I might create a new dataframe from the dataset selecting only Subaru and create a new uuid, color_uuid, based on the colour of the car. In the end I'd like to pd.concat the "subaru df" to the original dataframe. The problem I have is that I cannot reuse the color_uuid because the uuids have changed. – user18140022 Sep 09 '22 at 19:06
  • So you need a deterministic UUID or hash based on the brand. The link I gave has some suggestions along those lines. – mdurant Sep 09 '22 at 19:22
  • Oh that's brilliant! I'll have a look at it now :) – user18140022 Sep 12 '22 at 07:55