1

I have seen few questions like these

Vectorized alternative to iterrows , Faster alternative to iterrows , Pandas: Alternative to iterrow loops , for loop using iterrows in pandas , python: using .iterrows() to create columns , Iterrows performance. But it seems like everyone is a unique case rather a generalized approach.

My questions is also again about .iterrows.

I am trying to pass the first and second row to a function and create a list out of it.

What I have:

I have a pandas DataFrame with two columns that look like this.

         I.D         Score
1         11          26
3         12          26
5         13          26
6         14          25

What I did:

where the term Point is a function I earlier defined.

my_points = [Points(int(row[0]),row[1]) for index, row in score.iterrows()]

What I am trying to do:

The faster and vectorized form of the above.

PolarBear10
  • 2,065
  • 7
  • 24
  • 55

3 Answers3

1

The question is actually not about how you iter through a DataFrame and return a list, but rather how you can apply a function on values in a DataFrame by column.

You can use pandas.DataFrame.apply with axis set to 1:

df.apply(func, axis=1)

To put in a list, it depends what your function returns but you could:

df.apply(Points, axis=1).tolist()

If you want to apply on only some columns:

df[['Score', 'I.D']].apply(Points, axis=1)

If you want to apply on a func that takes multiple args use numpy.vectorize for speed:

np.vectorize(Points)(df['Score'], df['I.D'])

Or a lambda:

df.apply(lambda x: Points(x['Score'], x['I.D']), axis=1).tolist()
user3471881
  • 2,614
  • 3
  • 18
  • 34
  • this does not work as the function needs to take in 2 values and it needs to take it from some columns and not everything – PolarBear10 Nov 29 '18 at 09:56
  • It wasn't obvious from your question that your `DataFrame` contained more columns than `ID` and `Score` so I don't think that is a valid point. But, you can just `apply` by selecting the `columns` you want first. This can be used in a function that takes multiple values, but it depends on how the function is written - you didn't post it in your question. – user3471881 Nov 29 '18 at 10:01
  • My apologies, I wanted to thank you for actually rephrasing my question correctly. – PolarBear10 Nov 29 '18 at 10:05
1

Try list comprehension:

score = pd.concat([score] * 1000, ignore_index=True)

def Points(a,b):
    return (a,b)

In [147]: %timeit [Points(int(a),b) for a, b in zip(score['I.D'],score['Score'])]
1.3 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [148]: %timeit [Points(int(row[0]),row[1]) for index, row in score.iterrows()]
259 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [149]: %timeit [Points(int(row[0]),row[1]) for row in score.itertuples()]
3.64 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Have you ever tried the method .itertuples()?

my_points = [Points(int(row[0]),row[1]) for row in score.itertuples()]

Is a faster way to iterate over a pandas dataframe.

I hope it help.

Alejandro
  • 11
  • 4