0

I have two dataframes. One, named population has two columns randomly ordered positions. The other, named keyFrame, has two columns of ordered keys and a column of attributes ('attr') associated with the pair of keys.

I use the below code to:

  1. Create an empty column in population.
  2. Iterate over each row in keyFrame (the iterable dataframe is not being altered).
  3. Assign the rows 'attr' value to populations 'assignment' where either position1 == key1 & position2 == key2 OR where position1 == key2 & position2 == key1.

This works perfectly, but is extremely slow in my actual code. the population dataframe is >500k rows in actuality and the keyFrame dataframe has >1500 values.

Question: Is there a way to assign the 'attr' values from keyFrame to population where the keys match (interchangeably) all at once?

# Sample code for you to test! Thank you!    
import pandas as pd
import numpy as np

population = pd.DataFrame(data={'position1': [1, 6, 1, 1, 1, 7, 1, 8, 16],
                                'position2': [5, 1, 15, 9, 17, 1, 2, 1, 1]})
keyFrame = pd.DataFrame(data={'key1': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                              'key2': [2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17],
                              'attr': [0.79, 0.65, 0.99, 0.03, 0.58, 0.19, 0.53,
                                       0.76, 0.49, 0.46, 0.25, 0.11, 0.22, 0.38, 0.94]})

population['assignment'] = np.NaN  # Step 1
for index, row in keyFrame.iterrows():  # Step 2
    # Step 3
    population['assignment'].loc[((population['position1'] == row['key1']) & (
                population['position2'] == row['key2'])) | (
                (population['position1'] == row['key2']) & (
                population['position2'] == row['key1']))] = row['attr']

P.S. I am aware many questions exist that are similar to this, but they either don't fully match my use case or they don't solve the issue in a more efficient manner.

FINAL: Thanks to all the great suggestions! These all worked and were much faster than my original implementation!!

In terms of speed the results were as follows:

  1. BeRT2me's method: 18.45s
  2. Jānis Š's method: 21.34s
  3. ouroboros1's method: 26.96s

I must hazard for anyone who comes across these solutions though, they are sensitive to index values. Make sure to reset all indices for the population and keyFrame dataframes.

Paul
  • 165
  • 1
  • 12
  • 1
    Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – Chris Sep 27 '22 at 15:54
  • 1
    Pretty much *Anything* will be more efficient than `iterrows`. DataFrames are not meant to be iterated over for tasks like this... – BeRT2me Sep 27 '22 at 16:47

3 Answers3

1

One approach could be as follows:

population['assignment'] = population[['position1','position2']]\
    .apply(sorted, axis=1, result_type='expand')\
        .merge(keyFrame, left_on=[0,1], right_on=['key1','key2'], how='left')\
            ['attr']

   position1  position2  assignment
0          1          5        0.03
1          6          1        0.58
2          1         15        0.22
3          1          9        0.76
4          1         17        0.94
5          7          1        0.19
6          1          2        0.79
7          8          1        0.53
8         16          1        0.38

Explanation

  • First, we use .apply(sorted, axis=1, result_type='expand') to cols position1, position2 to get two cols (default names: [0,1]) with the values sorted for each row. E.g. [1,5], [1,6] etc.
  • Sorted, we can use df.merge with left_on=[0,1] and right_on=['key1','key2'], select only attr, and assign it to a new column for the df population.
ouroboros1
  • 9,113
  • 3
  • 7
  • 26
1

Maybe like this (almost the same as @ouroboros1 proposed):

population['key'] = [tuple(sorted([p1, p2])) for p1, p2 in zip(population.position1, population.position2)]
keyFrame['key'] = [tuple(sorted([k1, k2])) for k1, k2 in zip(keyFrame.key1, keyFrame.key2)]
population['assignment'] = population.merge(keyFrame[['key', 'attr']], on='key')['attr']
population.drop(columns=['key'], inplace=True)

Result:

   position1  position2  assignment
0          1          5        0.03
1          6          1        0.58
2          1         15        0.22
3          1          9        0.76
4          1         17        0.94
5          7          1        0.19
6          1          2        0.79
7          8          1        0.53
8         16          1        0.38
Jānis Š.
  • 532
  • 3
  • 14
1

One method would be to merge both ways and then combine the results.

# smol var names make me happy.
p_df = population
k_df = keyFrame

df1 = p_df.merge(k_df, left_on=['position1', 'position2'], right_on=['key1', 'key2'])
df2 = p_df.merge(k_df, left_on=['position2', 'position1'], right_on=['key1', 'key2'])
df = pd.concat([df1, df2], ignore_index=True)
print(df)

Output:

   position1  position2  key1  key2  attr
0          1          5     1     5  0.03
1          1         15     1    15  0.22
2          1          9     1     9  0.76
3          1         17     1    17  0.94
4          1          2     1     2  0.79
5          6          1     1     6  0.58
6          7          1     1     7  0.19
7          8          1     1     8  0.53
8         16          1     1    16  0.38

If we want to pre-sort things...

p_cols = ['position1', 'position2']
p_df[p_cols] = np.sort(p_df[p_cols], axis=1)

k_cols = ['key1', 'key2']
k_df[k_cols] = np.sort(k_df[k_cols], axis=1)

df = p_df.merge(k_df, left_on=['position1', 'position2'], right_on=['key1', 'key2'])
print(df)

(Same Output as above)

BeRT2me
  • 12,699
  • 2
  • 13
  • 31