0

Consider dataframes df1, df2, and df3.

df1 and df2 have an id column, and df3 has a from_id and to_id column.

I need to iterate over all rows of df3, and replace from_id and to_id with new unique randomly generated UUIDs, and then update those in df1 and df2 where (id == from_id) | (id == to_id) (before the change to UUID).

I originally wanted to iterate over all rows of df3 and simply check both df1 and df2 if they contain the original from_id or to_id inside the id column before replacing both, but I found that iterating over pandas rows is a bad idea and slow.

I'm not sure how I can apply the other mentioned methods in that post to this problem since I'm not applying a simple function or calculating anything, and I think the way I had intended to do it would be too slow for big dataframes.

My current method that I believe to be slow and inefficient:

import pandas as pd
import uuid

def rand_uuid():
    return uuid.uuid1()

def update_ids(df_places: pd.DataFrame, df_transitions: pd.DataFrame, df_arcs: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    for i in range(len(df_arcs)):
        new_uuid_from = __rand_uuid()
        new_uuid_to = __rand_uuid()
        new_uuid_arc = __rand_uuid()

        df_transitions.loc[df_transitions.id == df_arcs.iloc[i]['sourceId'], 'id'] = new_uuid_from
        df_transitions.loc[df_transitions.id == df_arcs.iloc[i]['destinationId'], 'id'] = new_uuid_to

        df_places.loc[df_places.id == df_arcs.iloc[i]['sourceId'], 'id'] = new_uuid_from
        df_places.loc[df_places.id == df_arcs.iloc[i]['destinationId'], 'id'] = new_uuid_to

        df_arcs.iloc[i]['sourceId'] = new_uuid_from
        df_arcs.iloc[i]['destinationId'] = new_uuid_to
        df_arcs.iloc[i]['id'] = new_uuid_arc

    return df_places, df_transitions, df_arcs

Here df_places and df_transitions are above mentioned df1 and df2, and df_arcs is df3

Example df_places

+---+----+
|   | id |
+---+----+
| 1 | a1 |
+---+----+
| 2 | c1 |
+---+----+

Example df_transitions:

+---+----+
|   | id |
+---+----+
| 1 | b1 |
+---+----+

Example df_arcs:

+---+----------+---------------+
|   | sourceId | destinationId |
+---+----------+---------------+
| 1 | a1       | b1            |
+---+----------+---------------+
| 2 | b1       | c1            |
+---+----------+---------------+
Jack Avante
  • 1,405
  • 1
  • 15
  • 32
  • You can create a mapping dict of old: new and then use the map method. – Eric Truett Apr 24 '21 at 13:37
  • Please show how `rand_uuid()` derive (i.e., all `import` and/or `def` lines). If you can generate a numpy array or pandas Series of UUIDs instead of one by one you can vectorize your operation and assign to other dfs using `merge`. – Parfait Apr 24 '21 at 14:16
  • I added the method and imports, as well as example data – Jack Avante Apr 24 '21 at 14:28

1 Answers1

2

A very simple approach:

import itertools
import uuid

def rand_uuid():
    return uuid.uuid4()

rep_dict = {i: rand_uuid() for i in itertools.chain(df1.id, df2.id)}

df3.replace(rep_dict, inplace=True)
df3.id = df3.id.map(lambda x: rand_uuid())

df1.replace(rep_dict, inplace=True)
df2.replace(rep_dict, inplace=True)
PieCot
  • 3,564
  • 1
  • 12
  • 20
  • I checked the generated dictionary `rep_dict`, and it seems to have generated the same UUIDs for all the entries.. Is this cause it happened too fast and UUID uses seconds, or something? – Jack Avante Apr 24 '21 at 18:52
  • 1
    Inspect the generated IDs, they should be very similar, but different in the end. BTW, I'd use `uuid4`(https://docs.python.org/3/library/uuid.html#uuid.uuid4) rather than `uuid1` to generate random unique identifiers not dependent on the host ID, the sequence number, and the time. I've update the answer. Give it a try! – PieCot Apr 24 '21 at 19:11
  • You're right!.. I did some tests and it seems your method is SO fast, that it actually finished in 5 nanoseconds.. and UUID1 cannot generate that fast. UUID4 worked perfectly though. One note for you answer: I also need to update the df3.id with a random UUID (differing from the ones matching df1 and df2). I figured it out, just thought I'd mention it for a possible edit to help future onlookers – Jack Avante Apr 24 '21 at 19:43
  • Glad to help :) – PieCot Apr 24 '21 at 19:44