7

The title is bit confusing but I'll do my best to explain my problem here. I have 2 pandas dataframes, a and b:

>> print a

id | value
 1 | 250
 2 | 150
 3 | 350
 4 | 550
 5 | 450

>> print b

low | high | class
100 | 200  | 'A' 
200 | 300  | 'B' 
300 | 500  | 'A' 
500 | 600  | 'C' 

I want to create a new column called class in table a that contains the class of the value in accordance with table b. Here's the result I want:

>> print a

id | value | class
 1 | 250   | 'B'
 2 | 150   | 'A'
 3 | 350   | 'A'
 4 | 550   | 'C'
 5 | 450   | 'A'

I have the following code written that sort of does what I want:

a['class'] = pd.Series()
for i in range(len(a)):
    val = a['value'][i]
    cl = (b['class'][ (b['low'] <= val) \
                      (b['high'] >= val) ].iat[0])
    a['class'].set_value(i,cl)

Problem is, this is quick for tables length of 10 or so, but I am trying to do this with a table size of 100,000+ for both a and b. Is there a quicker way to do this, using some function/attribute in pandas?

rbae
  • 73
  • 6

2 Answers2

6

Here is a way to do a range join inspired by @piRSquared's solution:

A = a['value'].values
bh = b.high.values
bl = b.low.values

i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh))

pd.DataFrame(
    np.column_stack([a.values[i], b.values[j]]),
    columns=a.columns.append(b.columns)
)

Output:

  id value  low high  class
0  1   250  200  300   'B' 
1  2   150  100  200   'A' 
2  3   350  300  500   'A' 
3  4   550  500  600   'C' 
4  5   450  300  500   'A'
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • ah, that's an excellent solution. I end up getting a MemoryError, however. piRSquared's solution involved tables of length 2M+, while mine is only around 100,000. i'll double check for any bugs. – rbae Jul 28 '17 at 03:48
  • @Scott Boston , can you please look at the question i posted, nobody answered it yet, i'll be glad if you help me. https://stackoverflow.com/questions/45511223/how-to-perform-single-operation-on-multiple-columns-of-dataframe – BlueQuant Aug 04 '17 at 17:34
2

Here's a solution that is admittedly less elegant than using Series.searchsorted, but it runs super fast!

I pull data out from the pandas DataFrames and convert them to lists and then use np.where to populate a variable called "aclass" where the conditions are satified (in brute force for loops). Then I write "aclass" to the original data frame a.

The evaluation time was 0.07489705 s, so it's pretty fast, even with 200,000 data points!

# create 200,000 fake a data points
avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value'])

blow = [100,200,300,500] # assuming you extracted this from b with list(b['low'])
bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high'])
bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class'])

aclass = [[]]*len(avalue) # initialize aclass

start_time = time.time() # this is just for timing the execution
for i in range(len(blow)):
    for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]:
        aclass[j]=bclass[i]

# add the class column to the original a DataFrame
a['class'] = aclass

print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8))
Bow
  • 987
  • 1
  • 10
  • 19