23

I have two versions of a function that uses Pandas for Python 2.7 to go through inputs.csv, row by row.

The first version uses Series.apply() on a single column, and goes through each row as intended.

The second version uses DataFrame.apply() on multiple columns, and for some reason it reads the top row twice. It then goes on to execute the rest of the rows without duplicates.

Any ideas why the latter reads the top row twice?


Version #1 – Series.apply() (Reads top row once)

import pandas as pd
df = pd.read_csv(inputs.csv, delimiter=",")

def v1(x):
    y = x
    return pd.Series(y)
df["Y"] = df["X"].apply(v1)

Version #2 – DataFrame.apply() (Reads top row twice)

import pandas as pd
df = pd.read_csv(inputs.csv, delimiter=",")

def v2(f):
    y = f["X"]
    return pd.Series(y)
df["Y"] = df[(["X", "Z"])].apply(v2, axis=1)

print y:

v1(x):            v2(f):

    Row_1         Row_1
    Row_2         Row_1
    Row_3         Row_2
                  Row_3
P A N
  • 5,642
  • 15
  • 52
  • 103
  • What is `y = f["X"]`? is this a typo? also you need to post raw input data or code to produce a df that reproduces your output – EdChum Aug 07 '15 at 12:40
  • @EdChum Thanks for your reply. `y = f["X"]` is supposed to make `y` equal to the current cell in column `"X"`. – P A N Aug 07 '15 at 12:41
  • Sorry I knocked up some dummy data and I cannot reproduce this, you'll have to post code that reproduces your output – EdChum Aug 07 '15 at 12:46
  • 2
    This is explained in the notes of the docstring: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html – joris Aug 07 '15 at 12:59
  • @joris: Thanks, that is probably it. Although when I tried with this test df, I could not reproduce the error: `df = pd.DataFrame({'X': ['X0', 'X1', 'X2', 'X3'], 'Z': ['Z0', 'Z1', 'Z2', 'Z3']})`. Something in my original csv that causes the `func` to "side-effect". Is there any work-around to make it skip doing the first row twice? – P A N Aug 07 '15 at 13:06
  • [This has been fixed in pandas 1.1, please upgrade.](https://stackoverflow.com/a/62893120/4909087) – cs95 Jul 14 '20 at 19:12

3 Answers3

19

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. Apply is a shortcut that intelligently applies aggregate, transform or filter. You can try breaking apart your function like so to avoid the duplicate calls.

AZhao
  • 13,617
  • 7
  • 31
  • 54
1

I sincerely don't see any explanation on this in the provided links, but anyway: I stumbled upon the same in my code, and did the silliest thing, i.e. short-circuit the first call. But it worked.

is_first_call = True

def refill_uniform(row, st=600):
    nonlocal is_first_call
    if is_first_call:
        is_first_call = False
        return row

... here goes the code

Oleg O
  • 1,005
  • 6
  • 11
1

I faced the same issue today and I spend few hours on google searching for solution. Finally I come up with a work around like this:

import numpy as np
import pandas as pd
import time

def foo(text):
    text = str(text) + ' is processed'
    return text


def func1(data):
    print("run1")
    return foo(data['text'])


def func2(data):
    print("run2")
    data['text'] = data['text'] + ' is processed'
    return data


def test_one():
    data = pd.DataFrame(columns=['text'], index=np.arange(0, 3))
    data['text'] = 'text'

    start = time.time()
    data = data.apply(func1, axis = 1)
    print(time.time() - start)

    print(data)


def test_two():
    data = pd.DataFrame(columns=['text'], index=np.arange(0, 3))
    data['text'] = 'text'

    start = time.time()
    data = data.apply(func2, axis=1)
    print(time.time() - start)
    print(data)


test_one()
test_two()

if you run the program you will see the result like this:

run1
run1
run1
0.0029706954956054688
0    text is processed
1    text is processed
2    text is processed
dtype: object
run2
run2
run2
run2
0.0049877166748046875
                             text
0  text is processed is processed
1               text is processed
2               text is processed

By splitting the function (func2) into func1 and foo, it runs the first row once only.

P C
  • 11
  • 2