Best way to change column data for all rows over multiple dataframes in pandas?

Question

Consider dataframes df1, df2, and df3.

df1 and df2 have an id column, and df3 has a from_id and to_id column.

I need to iterate over all rows of df3, and replace from_id and to_id with new unique randomly generated UUIDs, and then update those in df1 and df2 where (id == from_id) | (id == to_id) (before the change to UUID).

I originally wanted to iterate over all rows of df3 and simply check both df1 and df2 if they contain the original from_id or to_id inside the id column before replacing both, but I found that iterating over pandas rows is a bad idea and slow.

I'm not sure how I can apply the other mentioned methods in that post to this problem since I'm not applying a simple function or calculating anything, and I think the way I had intended to do it would be too slow for big dataframes.

My current method that I believe to be slow and inefficient:

import pandas as pd
import uuid

def rand_uuid():
    return uuid.uuid1()

def update_ids(df_places: pd.DataFrame, df_transitions: pd.DataFrame, df_arcs: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    for i in range(len(df_arcs)):
        new_uuid_from = __rand_uuid()
        new_uuid_to = __rand_uuid()
        new_uuid_arc = __rand_uuid()

        df_transitions.loc[df_transitions.id == df_arcs.iloc[i]['sourceId'], 'id'] = new_uuid_from
        df_transitions.loc[df_transitions.id == df_arcs.iloc[i]['destinationId'], 'id'] = new_uuid_to

        df_places.loc[df_places.id == df_arcs.iloc[i]['sourceId'], 'id'] = new_uuid_from
        df_places.loc[df_places.id == df_arcs.iloc[i]['destinationId'], 'id'] = new_uuid_to

        df_arcs.iloc[i]['sourceId'] = new_uuid_from
        df_arcs.iloc[i]['destinationId'] = new_uuid_to
        df_arcs.iloc[i]['id'] = new_uuid_arc

    return df_places, df_transitions, df_arcs

Here df_places and df_transitions are above mentioned df1 and df2, and df_arcs is df3

Example df_places

+---+----+
|   | id |
+---+----+
| 1 | a1 |
+---+----+
| 2 | c1 |
+---+----+

Example df_transitions:

+---+----+
|   | id |
+---+----+
| 1 | b1 |
+---+----+

Example df_arcs:

+---+----------+---------------+
|   | sourceId | destinationId |
+---+----------+---------------+
| 1 | a1       | b1            |
+---+----------+---------------+
| 2 | b1       | c1            |
+---+----------+---------------+

You can create a mapping dict of old: new and then use the map method. — Eric Truett, Apr 24 '21 at 13:37
Please show how `rand_uuid()` derive (i.e., all `import` and/or `def` lines). If you can generate a numpy array or pandas Series of UUIDs instead of one by one you can vectorize your operation and assign to other dfs using `merge`. — Parfait, Apr 24 '21 at 14:16

PieCot · Accepted Answer · 2021-04-25T10:53:56.123

2

A very simple approach:

import itertools
import uuid

def rand_uuid():
    return uuid.uuid4()

rep_dict = {i: rand_uuid() for i in itertools.chain(df1.id, df2.id)}

df3.replace(rep_dict, inplace=True)
df3.id = df3.id.map(lambda x: rand_uuid())

df1.replace(rep_dict, inplace=True)
df2.replace(rep_dict, inplace=True)

edited Apr 25 '21 at 10:53

answered Apr 24 '21 at 17:51

PieCot

3,564
1
12
20

I checked the generated dictionary `rep_dict`, and it seems to have generated the same UUIDs for all the entries.. Is this cause it happened too fast and UUID uses seconds, or something? – Jack Avante Apr 24 '21 at 18:52
1

Inspect the generated IDs, they should be very similar, but different in the end. BTW, I'd use `uuid4`(https://docs.python.org/3/library/uuid.html#uuid.uuid4) rather than `uuid1` to generate random unique identifiers not dependent on the host ID, the sequence number, and the time. I've update the answer. Give it a try! – PieCot Apr 24 '21 at 19:11
You're right!.. I did some tests and it seems your method is SO fast, that it actually finished in 5 nanoseconds.. and UUID1 cannot generate that fast. UUID4 worked perfectly though. One note for you answer: I also need to update the df3.id with a random UUID (differing from the ones matching df1 and df2). I figured it out, just thought I'd mention it for a possible edit to help future onlookers – Jack Avante Apr 24 '21 at 19:43
Glad to help :) – PieCot Apr 24 '21 at 19:44

Best way to change column data for all rows over multiple dataframes in pandas?

1 Answers1