Pandas function: DataFrame.apply() runs top row twice

Question

I have two versions of a function that uses Pandas for Python 2.7 to go through inputs.csv, row by row.

The first version uses Series.apply() on a single column, and goes through each row as intended.

The second version uses DataFrame.apply() on multiple columns, and for some reason it reads the top row twice. It then goes on to execute the rest of the rows without duplicates.

Any ideas why the latter reads the top row twice?

Version #1 – Series.apply() (Reads top row once)

import pandas as pd
df = pd.read_csv(inputs.csv, delimiter=",")

def v1(x):
    y = x
    return pd.Series(y)
df["Y"] = df["X"].apply(v1)

Version #2 – DataFrame.apply() (Reads top row twice)

import pandas as pd
df = pd.read_csv(inputs.csv, delimiter=",")

def v2(f):
    y = f["X"]
    return pd.Series(y)
df["Y"] = df[(["X", "Z"])].apply(v2, axis=1)

print y:

v1(x):            v2(f):

    Row_1         Row_1
    Row_2         Row_1
    Row_3         Row_2
                  Row_3

What is `y = f["X"]`? is this a typo? also you need to post raw input data or code to produce a df that reproduces your output — EdChum, Aug 07 '15 at 12:40
@EdChum Thanks for your reply. `y = f["X"]` is supposed to make `y` equal to the current cell in column `"X"`. — P A N, Aug 07 '15 at 12:41
Sorry I knocked up some dummy data and I cannot reproduce this, you'll have to post code that reproduces your output — EdChum, Aug 07 '15 at 12:46
This is explained in the notes of the docstring: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html — joris, Aug 07 '15 at 12:59
@joris: Thanks, that is probably it. Although when I tried with this test df, I could not reproduce the error: `df = pd.DataFrame({'X': ['X0', 'X1', 'X2', 'X3'], 'Z': ['Z0', 'Z1', 'Z2', 'Z3']})`. Something in my original csv that causes the `func` to "side-effect". Is there any work-around to make it skip doing the first row twice? — P A N, Aug 07 '15 at 13:06
[This has been fixed in pandas 1.1, please upgrade.](https://stackoverflow.com/a/62893120/4909087) — cs95, Jul 14 '20 at 19:12

score 19 · Accepted Answer · answered Mar 07 '16 at 18:59

19

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. Apply is a shortcut that intelligently applies aggregate, transform or filter. You can try breaking apart your function like so to avoid the duplicate calls.

answered Mar 07 '16 at 18:59

AZhao

13,617
7
31
54

4

How to do you break the function. Can you please help me with an example? – Nipun Feb 02 '20 at 10:44
for me it has nothing to do with "intelligently"... took me a while to identify this awesome design feature... – mojovski Aug 20 '20 at 20:32

score 1 · Answer 2 · answered Jun 17 '20 at 09:54

I sincerely don't see any explanation on this in the provided links, but anyway: I stumbled upon the same in my code, and did the silliest thing, i.e. short-circuit the first call. But it worked.

is_first_call = True

def refill_uniform(row, st=600):
    nonlocal is_first_call
    if is_first_call:
        is_first_call = False
        return row

... here goes the code

score 1 · Answer 3 · answered Jul 02 '20 at 20:06

I faced the same issue today and I spend few hours on google searching for solution. Finally I come up with a work around like this:

import numpy as np
import pandas as pd
import time

def foo(text):
    text = str(text) + ' is processed'
    return text


def func1(data):
    print("run1")
    return foo(data['text'])


def func2(data):
    print("run2")
    data['text'] = data['text'] + ' is processed'
    return data


def test_one():
    data = pd.DataFrame(columns=['text'], index=np.arange(0, 3))
    data['text'] = 'text'

    start = time.time()
    data = data.apply(func1, axis = 1)
    print(time.time() - start)

    print(data)


def test_two():
    data = pd.DataFrame(columns=['text'], index=np.arange(0, 3))
    data['text'] = 'text'

    start = time.time()
    data = data.apply(func2, axis=1)
    print(time.time() - start)
    print(data)


test_one()
test_two()

if you run the program you will see the result like this:

run1
run1
run1
0.0029706954956054688
0    text is processed
1    text is processed
2    text is processed
dtype: object
run2
run2
run2
run2
0.0049877166748046875
                             text
0  text is processed is processed
1               text is processed
2               text is processed

By splitting the function (func2) into func1 and foo, it runs the first row once only.

Pandas function: DataFrame.apply() runs top row twice

3 Answers3

Linked