(update: added desired
data frame)
Let me start by saying that I'm reasonably confident that I found a solution to this problem several years ago, but I have not been able to re-find that solution.
Questions that address similar problems, but don't solve my particular problem include:
- Efficiently select rows that match one of several values in Pandas DataFrame
- Efficiently adding calculated rows based on index values to a pandas DataFrame
- Compare Python Pandas DataFrames for matching rows
The Question
Let's say I have a data frame with many columns that I am working on:
big = pd.DataFrame({'match_1': [11, 12, 51, 52]})
big
match_1
0 11
1 12
2 51
3 52
I also have smaller data frame that, in theory, maps some conditional statement to a desired value:
# A smaller dataframe that we use to map values into the larger dataframe
small = pd.DataFrame({'is_even': [True, False], 'score': [10, 200]})
small
is_even score
0 True 10
1 False 200
The goal here would be to use a conditional statement to match each row in big
to a single row in small
. Assume that small
is constructed such that there was always be one, and only one, match for each row in big
. (If there has to be multiple rows in small
that match, just pick the first one.)
The desired output would be something like:
desired = pd.DataFrame({'match_1': [11, 12, 51, 52], 'metric': [200, 10, 200, 10]})
desired
match_1 metric
0 11 200
1 12 10
2 51 200
3 52 10
I'm pretty sure that the syntax would look similar to:
big['score'] = small.loc[small['is_even'] == ( (big['match_1'] / 2) == 0), 'score']
This won't work because small['is_even']
is a Series of length 2, while ( (big['match_1'] / 2) == 0)
is a Series of length 4. What I'm looking to do is, for each row in big
, find the one row in small
that matches based on a conditional.
If I can get a sequence that contains the correct row in small
that matches each row in big
, then I could do something like:
`big['score'] = small.loc[matching_rows, 'score']
The question I have is: how do I generate the Sequence matching rows
?
Things that (I think) aren't quite what I want:
If the columns in big
and small
were to match simply on constant values, this would be a straight forward use of either big.merge()
or big.groupby()
, however, in my case, the mapping can be an arbitrarily complex boolean conditional, for example:
(big['val1'] > small['threshold']) & (big['val2'] == small['val2']) & (big['val3'] > small['min_val']) & (big['val3'] < small['max_val'])
Solutions that rely on isin()
, any()
, etc, don't work, because the conditional check can be arbitrarily complex.
I could certainly create a function to apply()
to the bigger DataFrame, but again, I'm pretty sure there was a simpler solution.
The answer may come down to 'calculate some intermediate columns until you can do a simple merge' or 'just use apply()
, but I could swear that there was a way to do what I've described above.