Increase performance when setting Pandas column

Question

Is there a way to improve this code writen in Python. I use the library Pandas and Python 3.4:

bd_data = pd.DataFrame(list(bd_data))
    column = list(bd_data[numeric])
    for i in range(0,len(column)):
        pos = bisect.bisect_left(intervalsArray,int(column[i]))
        bd_data.ix[i,'colorCluster'] = colorsPalette[pos]

I'm trying to assign a color in colorCluster from a colorPalette based on the position of a number in a list of intervals. It is taking about 6 seconds to process 16000 rows, which is way too much. I think I'm not using Pandas the way it is intended, specially here:

bd_data.ix[i,'colorCluster']

I'm actually doing this in R (with rpy2) with this line of code in less than a second:

dataToAnalyse$colorCluster <- colorsPalette[findInterval(dataToAnalyse$numeric, intervals)+1]

I'm sure there is a way to increase performance in Python, as many people say processing is faster in this language more often (not always) than in R. Also, please advice a better title for the question as I'm not fluent with Pandas terminology.

jezrael · Accepted Answer · 2017-04-12T05:16:06.250

2

You can change:

bd_data.ix[i,'colorCluster'] = colorsPalette[pos]

to DataFrame.set_value:

bd_data.set_value(i, 'colorCluster', colorsPalette[pos])

edited Apr 12 '17 at 05:16

answered Apr 12 '17 at 05:14

jezrael

822,522
95
1,334
1,252

I'll try and tell you how it went – AFP_555 Apr 12 '17 at 05:15
That did it! It's even faster than R. Thanks. – AFP_555 Apr 12 '17 at 05:17
I know this isn't a part of the question, but could you also teach me if there is any way to program the same with less code? – AFP_555 Apr 12 '17 at 05:18
Hmmm, I see your code with loops and (maybe) problem is with `bisect.bisect_left` - if it working with scalar only, need loops. But maybe your code can be replaced by pandas function(s), but need explain more, best is use some date sample with desired output. – jezrael Apr 12 '17 at 05:21

Impuls3H · Answer 2 · 2017-04-12T08:10:37.157

0

I'm quite new to python myself..but will a list comprehension be faster in this case?

bd_data['colorCluster'] = [colorsPalette[bisect.bisect_left(intervalsArray,int(column_iter))] for column_iter in column]

Edit: Will an apply be faster instead?

bd_data['colorCluster'] = bd_data.apply(lambda x: bisect.bisect_left(intervalsArray,x))

edited Apr 12 '17 at 08:10

answered Apr 12 '17 at 06:15

Impuls3H

303
1
2
11

I'm quite curious myself, so can you kindly test it out and let me know if its faster? Thank you! – Impuls3H Apr 12 '17 at 06:16
Ok, tried it. It is actually faster than what I did, but not faster than the selected answer. It took about 2 seconds. Keep in mind I'm not measuring this correctly, just counting in my head. – AFP_555 Apr 12 '17 at 07:01
Oh that's interesting. I always had the impression that list comprehensions are marginally faster than loops. It seems like in this case the list comprehension is slower due to the overhead of creating and extending the list. there's a good explanation of when to use what [in this SO question](http://stackoverflow.com/questions/22108488/are-list-comprehensions-and-functional-functions-faster-than-for-loops). Anyway you can use [timeit](https://docs.python.org/2/library/timeit.html) for accurate comparison of runtimes. – Impuls3H Apr 12 '17 at 07:49
I added a new option using an apply. Will that be faster? Let me know of the results! I'm intrigued ;) Thanks! – Impuls3H Apr 12 '17 at 08:11

Increase performance when setting Pandas column

2 Answers2