How can I iterate over two dataframes to compare data and do processing?

Question

I have two different dataframes: A, B. The column Event has similar data that I'm using to compare the two dataframes. I want to give Dataframe A a new column, dfA.newContext#.

In order to do this, I'll need to use the Event column. I want to iterate through Dataframe A to find a match for Event and assign the dfB.context# to dfA.newContext#

I think a loop would be the best way since I have a few conditions that I need to check.

This might be asking a bit much but I'm really stuck.. I want to do something like this:

offset = 0
Iterate through dfA:
    extract event
    extract context#
        Iterate through dfB:
            if dfB.event == dfA.event:
                dfA.newContext# = dfB.context#
                offset = dfA.new_context# - dfA.context#
                if dfB.event == "Special":
                    dfA.newContext# = dfA.context# - offset

Dataframe A

+-------------+---------+------+
|dfA.context# |dfA.event| Name |
+-------------+---------+------+
| 0           | Special | Bob  |
| 2           | Special | Joan |
| 4           |    Bird | Susie|
| 5           | Special | Alice|
| 6           | Special | Tom  |
| 7           | Special | Luis |
| 8           |  Parrot | Jill |
| 9           | Special | Reed |
| 10          | Special | Lucas|
| 11          |   Snake | Kat  |
| 12          | Special | Bill |
| 13          | Special | Leo  |
| 14          | Special | Peter|
| 15          | Special | Mark |
| 16          | Special | Joe  |
| 17          | Special | Lora |
| 18          | Special | Care |
| 19          |Elephant | David|
| 20          | Special | Ann  |
| 21          | Special | Larry|
| 22          |   Skunk | Tony |
+-------------+---------+------+

Dataframe B

+-------------+---------+
|dfB.context# |dfB.event|
+-------------+---------+
| 0           | Special |
| 0           | Special |
| 0           | Special |
| 1           | Special |
| 1           | Special |
| 1           | Special |
| 1           | Special |
| 2           |    Bird |
| 2           |    Bird |
| 3           | Special |
| 6           |  Parrot |
| 6           |  Parrot |
| 6           |  Parrot |
| 6           |  Parrot |
| 7           | Special |
| 7           | Special |
| 9           |   Snake |
| 9           |   Snake |
| 9           |   Snake |
| 10          | Special |
| 17          |Elephant |
| 17          |Elephant |
| 17          |Elephant |
| 18          | Special |
| 18          | Special |
| 20          |  Skunk  |
| 20          |  Skunk  |
| 21          | Special |
| 26          | Antelope|
+-------------+---------+

Desired DF

+-------------+---------+------+-------------+
|dfA.context# |dfA.event| Name |dfA.newContext#|
+-------------+---------+------+-------------+
| 0           | Special | Bob  |           0 |
| 2           | Special | Joan |           1 |
| 4           |    Bird | Susie|           2 |
| 5           | Special | Alice|           3 |
| 6           | Special | Tom  |             |
| 7           | Special | Luis |             |
| 8           |  Parrot | Jill |           6 |
| 9           | Special | Reed |           7 |
| 10          | Special | Lucas|             |
| 11          |   Snake | Kat  |           9 |
| 12          | Special | Bill |          10 | 
| 13          | Special | Leo  |             |
| 14          | Special | Peter|             |
| 15          | Special | Mark |             |
| 16          | Special | Joe  |             |
| 17          | Special | Lora |             |
| 18          | Special | Care |             |
| 19          |Elephant | David|          17 |
| 20          | Special | Ann  |          18 |
| 21          | Special | Larry|             |
| 22          |   Skunk | Tony |          20 |
+-------------+---------+------+-------------+

How can I iterate through the two dataframes at once and access the information ?

https://pandas.pydata.org/docs/user_guide/basics.html#iteration — wwii, Aug 23 '20 at 03:11
I'm confused with how exactly to iterate over the two dataframes and to access the information in each of the row/columns. I think with pandas it's not very loop-friendly but I think it might be the best way to iterate through more explicitly to handle each of my cases. — hollyj, Aug 23 '20 at 03:24
If you have a dataframe `df`, you could traverse it row-wise as : `for i in range(len(df)):` and within the for loop access each element can be accessed using `df.loc[row, col]`. Also if you want to change some element, use `df.at[row, col] = new_value`. — Suraj, Aug 23 '20 at 03:38

David Erickson · Accepted Answer · 2020-08-23T04:25:57.767

95% of the time you can use a pandas vectorized method and eliminate the need for looping. In this case you can just use pd.merge in a simple, clean and efficient alternative to a long loop.

EDIT: (Answer #1): Actually, you can do a more advanced merge with left_on=dfA.index, right_on='context' and do this in one line with other clean up operations after the merge, but see a fuller answer below, which takes a similar approach:

df = (pd.merge(dfA, dfB['context'], how='left', left_on=dfA.index, right_on='context')
        .drop_duplicates()
        .dropna(subset=['Name'])
        .drop('context', axis=1)
        .rename({'context_x' : 'context', 'context_y' : 'newContext'}, axis=1).fillna(''))

Answer #2: You can just merge the two dataframes together after manipulating the two dataframes in preparation for the merge:

dfA - make the context column in dfA equal to the index, but before changing it, save it as a series s for later
dfB - drop duplicates, reset the index, and change the name of the index to newContext in preparation of the merge.
Merge on event and context and replace newContext values with context values where null.
Change context back to it's original data with df['context'] = s

s = dfA['context']
dfA['context'] = dfA.index.astype(str)
dfB = dfB.drop_duplicates().reset_index().rename({'index' :'newContext'}, axis=1).astype(str)
df = pd.merge(dfA, dfB, how='left', on=['event', 'context'])
df['newContext'] = df['newContext'].where(df['newContext'].isnull(), df['context']).fillna('')
df['context'] = s
df
Out[9]: 
    context     event   Name newContext
0         0   Special    Bob          0
1         2   Special   Joan          1
2         4      Bird  Susie          2
3         5   Special  Alice          3
4         6   Special    Tom           
5         7   Special   Luis           
6         8    Parrot   Jill          6
7         9   Special   Reed          7
8        10   Special  Lucas           
9        11     Snake    Kat          9
10       12   Special   Bill         10
11       13   Special    Leo           
12       14   Special  Peter           
13       15   Special   Mark           
14       16   Special    Joe           
15       17   Special   Lora           
16       18   Special   Care           
17       19  Elephant  David         17
18       20   Special    Ann         18
19       21   Special  Larry           
20       22     Skunk   Tony         20

Hi David, Thanks ! I was going to merge after preparing as well, but only on context, since there are many "Special" events in dfA (but I am only interested in the data for some of them). I want to preserve all rows in dfB - I basically just want to add row data from dfA to dfB. This solution will work for this set of data shown, but is there a way to traverse through and calculate the offset between the dfA context and dfA newContext ? I'll need to apply this to get the dfA newContext of some Special events. — hollyj, Aug 23 '20 at 04:30
@hollyj Iterative debugging is generally discouraged in Stack Overflow. If my solution answers the specific question (input/output) you have posted, please consider accepting and upvoting the answer. You can kindly post a fresh question with new input data and expected output data (where this solution wouldn't work), and where I or others can try to propose a solution. — David Erickson, Aug 23 '20 at 04:33
I agree that a long loop is not the most efficient way, but because there are some cases where the offset changes, I think it might be the best way.. for example, the data will change to an offset of 8 and I need to know which of the sequentially listed "Special" events and corresponding name to use. I will choose the correct "Special" event and name based on this offset of 8 that was used the dfA newContext - dfA context directly prior to this "Special" event. — hollyj, Aug 23 '20 at 04:56
Apologize if this is poor commenting etiquette, but for completeness, I added a new followup question here: [link](https://stackoverflow.com/questions/63543875/how-can-i-iterate-through-two-dataframes-and-assign-a-new-values-based-on-compar) — hollyj, Aug 23 '20 at 05:57
@hollyj no that is completely fine. I will look into tomorrow, but there is probably someone who will answer your question before then. — David Erickson, Aug 23 '20 at 05:58
Also @hollyj for “offset” check out .shift() You can offset a column by 2,-2, etc. with .shift(2) or .shift(-2) or any number. Default for .shift() is 1. I use shift ALL the time. If you look at a couple of my recent answers in my profile, in fact, I show how to use shift effectively. This is also a common alternative to looping. — David Erickson, Aug 23 '20 at 06:02
.shift() is a great solution for consistent offset! Thank you for this suggestion. I've looked into it though and I don't think it solves my issue as there are many different offsets that change throughout my dataframe.. — hollyj, Aug 23 '20 at 16:33

How can I iterate over two dataframes to compare data and do processing?

1 Answers1