Faster apply of a function to every row in pandas

Question

I have a column ("color_values") in my df with some numbers from 1 to 10 and I want to transform those numbers into hex colors with matplotlib.cm (cm) and matplotlib.colors (mcol).

Here I build my pallete:

color_list = ["#084594", ...] # my colors
cm1 = mcol.ListedColormap(color_list)
cnorm = mcol.Normalize(vmin=df["color_values"].min(), vmax=df["color_values"].max())
cpick = cm.ScalarMappable(norm=cnorm, cmap=cm1)
cpick.set_array(np.array([]))

And this is the part that needs to be faster because I have millions of rows:

df["color_hex"] = df.apply(
            lambda row: mcol.to_hex(cpick.to_rgba(row["color_values"])), axis=1
    )

I'm inserting another column (color_hex) that transforms the value from color_values into hex colors, but it does so by looping through every cell.

I looked at numpy.vectorize, but in their docs they say The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

I also looked at numpy.where but that seems more fit when you have a condition to satisfy, which is not my case.

So I was wondering what other numpy operations are fit for this?

The actual problem is solved in [this question](https://stackoverflow.com/questions/49156484/fast-way-to-map-scalars-to-colors-in-python). If the unnecessary use of matplotlib is nonetheless desired, check the `apply2` case from [this answer](https://stackoverflow.com/a/47398328/4124317), which uses `numpy.apply_along_axis`. — ImportanceOfBeingErnest, Mar 07 '18 at 22:36

score 5 · Answer 1 · answered Mar 07 '18 at 10:37

5

There are 2 ways that may improve performance. Without data it is difficult to confirm whether this is indeed the case.

1. Use pd.Series.apply instead of pd.DataFrame.apply

df['color_hex'] = df['color_values'].apply(lambda x: mcol.to_hex(cpick.to_rgba(x)))

This reduces the amount of structured data that needs to be passed through a loop.

2. Use a list comprehension

df['color_hex'] = [mcol.to_hex(cpick.to_rgba(x)) for x in df['color_values']]

This works because a list can be assigned directly to a pd.Series.

answered Mar 07 '18 at 10:37

jpp

159,742
34
281
339

which is faster? – Gonzalo Garcia Jul 07 '20 at 03:24
1

@GonzaloGarcia, If your data is clean, probably the list comprehension. It's recommended you test with your data. – jpp Jul 07 '20 at 06:59

Faster apply of a function to every row in pandas

1 Answers1