0

(update: added desired data frame)

Let me start by saying that I'm reasonably confident that I found a solution to this problem several years ago, but I have not been able to re-find that solution.

Questions that address similar problems, but don't solve my particular problem include:


The Question

Let's say I have a data frame with many columns that I am working on:

big = pd.DataFrame({'match_1': [11, 12, 51, 52]})
big
   match_1
0       11
1       12
2       51
3       52

I also have smaller data frame that, in theory, maps some conditional statement to a desired value:

# A smaller dataframe that we use to map values into the larger dataframe
small = pd.DataFrame({'is_even': [True, False], 'score': [10, 200]})
small
  is_even  score
0    True     10
1   False    200

The goal here would be to use a conditional statement to match each row in big to a single row in small. Assume that small is constructed such that there was always be one, and only one, match for each row in big. (If there has to be multiple rows in small that match, just pick the first one.)

The desired output would be something like:

desired = pd.DataFrame({'match_1': [11, 12, 51, 52], 'metric': [200, 10, 200, 10]})
desired
 match_1  metric
0       11     200
1       12      10
2       51     200
3       52      10

I'm pretty sure that the syntax would look similar to:

big['score'] = small.loc[small['is_even'] == ( (big['match_1'] / 2) == 0), 'score']

This won't work because small['is_even'] is a Series of length 2, while ( (big['match_1'] / 2) == 0) is a Series of length 4. What I'm looking to do is, for each row in big, find the one row in small that matches based on a conditional.

If I can get a sequence that contains the correct row in small that matches each row in big, then I could do something like:

`big['score'] = small.loc[matching_rows, 'score']

The question I have is: how do I generate the Sequence matching rows?


Things that (I think) aren't quite what I want:

If the columns in big and small were to match simply on constant values, this would be a straight forward use of either big.merge() or big.groupby(), however, in my case, the mapping can be an arbitrarily complex boolean conditional, for example:

(big['val1'] > small['threshold']) & (big['val2'] == small['val2']) & (big['val3'] > small['min_val']) & (big['val3'] < small['max_val'])

Solutions that rely on isin(), any(), etc, don't work, because the conditional check can be arbitrarily complex.

I could certainly create a function to apply() to the bigger DataFrame, but again, I'm pretty sure there was a simpler solution.


The answer may come down to 'calculate some intermediate columns until you can do a simple merge' or 'just use apply(), but I could swear that there was a way to do what I've described above.

Community
  • 1
  • 1
Zachary Cross
  • 2,298
  • 1
  • 15
  • 22
  • Does dataframes derive in a database? If so, your conditionally matching (one-to-many) can be handled in SQL. No need for pandas to restructure in memory. – Parfait Nov 17 '15 at 20:24

2 Answers2

2

One approach is to use a merge in which the on_left is not a column, but a vector of keys. It's made simpler by setting the index of small to be is_even:

>>> small.set_index('is_even', inplace=True)
>>> condition = big['match_1'] % 2 == 0
>>> pd.merge(big, small, left_on=condition, right_index=True, how='left')
   match_1  score
0       11    200
1       12     10
2       51    200
3       52     10
jme
  • 19,895
  • 6
  • 41
  • 39
1

You can index small with True and False and just do a straight .ix lookup on it. Not sure it's all that much tidier than the intermediate column/merge:

In [127]: big = pd.DataFrame({'match_1': [11, 12, 51, 52]})

In [128]: small = pd.DataFrame({'score': [10, 200]}, index=[True, False])

In [129]: big['score'] = small.ix[pd.Index(list(big.match_1 % 2 == 0))].score.values

In [130]: big
Out[130]:
   match_1  score
0       11    200
1       12     10
2       51    200
3       52     10
Randy
  • 14,349
  • 2
  • 36
  • 42