Comparing a value from one dataframe with values from columns in another dataframe and getting the data from third column

Question

The title is bit confusing but I'll do my best to explain my problem here. I have 2 pandas dataframes, a and b:

>> print a

id | value
 1 | 250
 2 | 150
 3 | 350
 4 | 550
 5 | 450

>> print b

low | high | class
100 | 200  | 'A' 
200 | 300  | 'B' 
300 | 500  | 'A' 
500 | 600  | 'C'

I want to create a new column called class in table a that contains the class of the value in accordance with table b. Here's the result I want:

>> print a

id | value | class
 1 | 250   | 'B'
 2 | 150   | 'A'
 3 | 350   | 'A'
 4 | 550   | 'C'
 5 | 450   | 'A'

I have the following code written that sort of does what I want:

a['class'] = pd.Series()
for i in range(len(a)):
    val = a['value'][i]
    cl = (b['class'][ (b['low'] <= val) \
                      (b['high'] >= val) ].iat[0])
    a['class'].set_value(i,cl)

Problem is, this is quick for tables length of 10 or so, but I am trying to do this with a table size of 100,000+ for both a and b. Is there a quicker way to do this, using some function/attribute in pandas?

score 6 · Answer 1 · answered Jul 28 '17 at 03:30

6

Here is a way to do a range join inspired by @piRSquared's solution:

A = a['value'].values
bh = b.high.values
bl = b.low.values

i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh))

pd.DataFrame(
    np.column_stack([a.values[i], b.values[j]]),
    columns=a.columns.append(b.columns)
)

Output:

  id value  low high  class
0  1   250  200  300   'B' 
1  2   150  100  200   'A' 
2  3   350  300  500   'A' 
3  4   550  500  600   'C' 
4  5   450  300  500   'A'

answered Jul 28 '17 at 03:30

Scott Boston

147,308
15
139
187

ah, that's an excellent solution. I end up getting a MemoryError, however. piRSquared's solution involved tables of length 2M+, while mine is only around 100,000. i'll double check for any bugs. – rbae Jul 28 '17 at 03:48
@Scott Boston , can you please look at the question i posted, nobody answered it yet, i'll be glad if you help me. https://stackoverflow.com/questions/45511223/how-to-perform-single-operation-on-multiple-columns-of-dataframe – BlueQuant Aug 04 '17 at 17:34

score 2 · Accepted Answer · answered Jul 28 '17 at 03:52

Here's a solution that is admittedly less elegant than using Series.searchsorted, but it runs super fast!

I pull data out from the pandas DataFrames and convert them to lists and then use np.where to populate a variable called "aclass" where the conditions are satified (in brute force for loops). Then I write "aclass" to the original data frame a.

The evaluation time was 0.07489705 s, so it's pretty fast, even with 200,000 data points!

# create 200,000 fake a data points
avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value'])

blow = [100,200,300,500] # assuming you extracted this from b with list(b['low'])
bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high'])
bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class'])

aclass = [[]]*len(avalue) # initialize aclass

start_time = time.time() # this is just for timing the execution
for i in range(len(blow)):
    for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]:
        aclass[j]=bclass[i]

# add the class column to the original a DataFrame
a['class'] = aclass

print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8))

much quicker than my looping scheme I have! will test this solution as well! — rbae, Jul 28 '17 at 04:05

Comparing a value from one dataframe with values from columns in another dataframe and getting the data from third column

2 Answers2