Vectorized way for applying a function to a dataframe to create lists

Question

I have seen few questions like these

Vectorized alternative to iterrows , Faster alternative to iterrows , Pandas: Alternative to iterrow loops , for loop using iterrows in pandas , python: using .iterrows() to create columns , Iterrows performance. But it seems like everyone is a unique case rather a generalized approach.

My questions is also again about .iterrows.

I am trying to pass the first and second row to a function and create a list out of it.

What I have:

I have a pandas DataFrame with two columns that look like this.

         I.D         Score
1         11          26
3         12          26
5         13          26
6         14          25

What I did:

where the term Point is a function I earlier defined.

my_points = [Points(int(row[0]),row[1]) for index, row in score.iterrows()]

What I am trying to do:

The faster and vectorized form of the above.

So you want to apply a function on values in a `DataFrame`, and return a list? Try `DataFrame.apply` - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html. — user3471881, Nov 29 '18 at 09:25
The way you wrote the sentence actually made me understand my question more. — PolarBear10, Nov 29 '18 at 09:27

user3471881 · Answer 1 · 2018-11-29T10:18:06.127

1

The question is actually not about how you iter through a DataFrame and return a list, but rather how you can apply a function on values in a DataFrame by column.

You can use pandas.DataFrame.apply with axis set to 1:

df.apply(func, axis=1)

To put in a list, it depends what your function returns but you could:

df.apply(Points, axis=1).tolist()

If you want to apply on only some columns:

df[['Score', 'I.D']].apply(Points, axis=1)

If you want to apply on a func that takes multiple args use numpy.vectorize for speed:

np.vectorize(Points)(df['Score'], df['I.D'])

Or a lambda:

df.apply(lambda x: Points(x['Score'], x['I.D']), axis=1).tolist()

edited Nov 29 '18 at 10:18

answered Nov 29 '18 at 09:30

user3471881

2,614
3
18
34

this does not work as the function needs to take in 2 values and it needs to take it from some columns and not everything – PolarBear10 Nov 29 '18 at 09:56
It wasn't obvious from your question that your `DataFrame` contained more columns than `ID` and `Score` so I don't think that is a valid point. But, you can just `apply` by selecting the `columns` you want first. This can be used in a function that takes multiple values, but it depends on how the function is written - you didn't post it in your question. – user3471881 Nov 29 '18 at 10:01
My apologies, I wanted to thank you for actually rephrasing my question correctly. – PolarBear10 Nov 29 '18 at 10:05

jezrael · Accepted Answer · 2018-11-29T10:00:16.803

1

Try list comprehension:

score = pd.concat([score] * 1000, ignore_index=True)

def Points(a,b):
    return (a,b)

In [147]: %timeit [Points(int(a),b) for a, b in zip(score['I.D'],score['Score'])]
1.3 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [148]: %timeit [Points(int(row[0]),row[1]) for index, row in score.iterrows()]
259 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [149]: %timeit [Points(int(row[0]),row[1]) for row in score.itertuples()]
3.64 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 29 '18 at 10:00

answered Nov 29 '18 at 09:30

jezrael

822,522
95
1,334
1,252

1

This reduced my processing time from 21 minutes to 31 seconds. Thank you. – PolarBear10 Nov 29 '18 at 09:55
@Matthew - ya, try `apply`, but in my opinion it should be slowier, because some security checks. – jezrael Nov 29 '18 at 09:57
it seems like itertuples is also very close to the performance of this one, in my case. – PolarBear10 Nov 29 '18 at 09:58
1

.apply in this is not applicable for my case – PolarBear10 Nov 29 '18 at 09:58
1

@Matthew - yes, only a bit slowier - add to timings in my answer. – jezrael Nov 29 '18 at 10:00

score 1 · Answer 3 · answered Nov 29 '18 at 09:45

1

Have you ever tried the method .itertuples()?

my_points = [Points(int(row[0]),row[1]) for row in score.itertuples()]

Is a faster way to iterate over a pandas dataframe.

I hope it help.

answered Nov 29 '18 at 09:45

Alejandro

11
4

This one also fits very well! Thank you – PolarBear10 Nov 29 '18 at 09:57
The jezrael answer seems to be the fastest @alejandro but thank you for your time! – PolarBear10 Nov 29 '18 at 10:06

Vectorized way for applying a function to a dataframe to create lists

3 Answers3