optimizing Pandas dataframe - Moving from Iterrow to Numpy Series or Apply

Question

We have some code in Python using Pandas. We iterate through each row, take a value from a column and pass that as a parameter to a function call. Currently we use iterrows() method but am looking to optimize it.

Here is my existing code:

df_input = pd.read_csv("input.csv")
df = pd.DataFrame()

for index, row in df_input.iterrows():
    variable1 = do_something_1(parameter1, parameter2, row["body"])
    listOfSeries = [pd.Series([row["id"], row["body"], variable1], index=['id', 'body', 'column1'] )]
    df = df.append(listOfSeries, ignore_index=True)

I am trying to improve the performance of the code. I did read the thread here: Does pandas iterrows have performance issues?

I think I can use the apply method for calling the do_something_1 function on the entire dataframe but how do I save the results from the do_something_1 function to a new column in the same dataframe?

score 0 · Answer 1 · answered Dec 01 '21 at 20:37

There are several issues with this code. Both iterrows and .append are very inefficient for this task. You could use apply as you mentioned:

df = pd.DataFrame({'id':[1,2,3], 'body':['a','b','c']})
def do_sth(v, p):
    return p + v

df['column1'] = df['body'].apply(do_sth, args = ('The body is: ',))

    id  body    column1
0   1   a   The body is: a
1   2   b   The body is: b
2   3   c   The body is: c

Note that do_sth takes the first variable from the DataFrame instead of the last in your example. You may need a wrapper function for this to work.

optimizing Pandas dataframe - Moving from Iterrow to Numpy Series or Apply

1 Answers1