I have two dataframes. One, named population has two columns randomly ordered positions. The other, named keyFrame, has two columns of ordered keys and a column of attributes ('attr'
) associated with the pair of keys.
I use the below code to:
- Create an empty column in population.
- Iterate over each row in keyFrame (the iterable dataframe is not being altered).
- Assign the rows
'attr'
value to populations'assignment'
where eitherposition1 == key1 & position2 == key2
OR whereposition1 == key2 & position2 == key1
.
This works perfectly, but is extremely slow in my actual code. the population dataframe is >500k rows in actuality and the keyFrame dataframe has >1500 values.
Question: Is there a way to assign the 'attr'
values from keyFrame to population where the keys match (interchangeably) all at once?
# Sample code for you to test! Thank you!
import pandas as pd
import numpy as np
population = pd.DataFrame(data={'position1': [1, 6, 1, 1, 1, 7, 1, 8, 16],
'position2': [5, 1, 15, 9, 17, 1, 2, 1, 1]})
keyFrame = pd.DataFrame(data={'key1': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'key2': [2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17],
'attr': [0.79, 0.65, 0.99, 0.03, 0.58, 0.19, 0.53,
0.76, 0.49, 0.46, 0.25, 0.11, 0.22, 0.38, 0.94]})
population['assignment'] = np.NaN # Step 1
for index, row in keyFrame.iterrows(): # Step 2
# Step 3
population['assignment'].loc[((population['position1'] == row['key1']) & (
population['position2'] == row['key2'])) | (
(population['position1'] == row['key2']) & (
population['position2'] == row['key1']))] = row['attr']
P.S. I am aware many questions exist that are similar to this, but they either don't fully match my use case or they don't solve the issue in a more efficient manner.
FINAL: Thanks to all the great suggestions! These all worked and were much faster than my original implementation!!
In terms of speed the results were as follows:
- BeRT2me's method: 18.45s
- Jānis Š's method: 21.34s
- ouroboros1's method: 26.96s
I must hazard for anyone who comes across these solutions though, they are sensitive to index values. Make sure to reset all indices for the population
and keyFrame
dataframes.