1

I'm new to python and pandas, and I don't understand why sometimes I can copy or manipulate dataframe columns and sometimes I can't. Here is a dataframe called df:

         A      B          C
0    hello   hola    bonjour
1  goodbye  adios  au revoir

My first modification will be overwriting column 'C' row by row with a for-loop (yes, there may be better ways to do this - that's not the point), and the result is as I expected:

for index,row in df.iterrows():
    row['C'] = row['A']

         A      B        C
0    hello   hola    hello
1  goodbye  adios  goodbye

I can add a new column using the following for-loop, and again, I get what I expected:

for index,row in df.iterrows():
    df.ix[index,'D'] = len(row['C'])

         A      B        C  D
0    hello   hola    hello  5
1  goodbye  adios  goodbye  7

Now I try almost exactly the same thing as my first modification (overwriting column 'B' row by row with a for-loop), but this time it doesn't work. The dataframe does not change this time.

for index,row in df.iterrows():
    row['B'] = row['A']

         A      B        C  D
0    hello   hola    hello  5
1  goodbye  adios  goodbye  7

I want to know 2 things:

1) Why does the same code overwrite a column some times, but not other times?

2) Am I doing something wrong that causes pandas to behave in an unintuitive manner like this? If so, what is the proper way to construct one column from another so that this kind of thing doesn't happen?

Any good answer or advice is much appreciated. Thanks!

Adam K
  • 21
  • 2

2 Answers2

2

First, you should never try to modify your dataframe while iterating over it.

To construct a new column, you can just do:

df['C'] = df['A']

or specifically for getting the length of each string (see docs):

df['D'] = df['C'].str.len()

The reason for the different outputs is that it you get a view or a copy of the original data depending on the circumstances (if it is homogeneous in dtypes or not).
In your case, the first time, all columns are of type string and you get a view of the original data and modifications will be reflected in the original dataframe. But after adding column D, the columns have different dtypes and you get a copy when iterating. For this reason, the adaptations are not reflected in the original dataframe in your last case (see also this issue).

joris
  • 133,120
  • 36
  • 247
  • 202
2

The reason the code generates different answers has to do with itterrows providing a view vs copy of the data. If you assign to the view, it modifies the original data, while assigning to a copy does nothing.

From what I understand (see this answer), itterrows will generate a view only for a single-dtyped object, which is the assignment works when all columns are strings, but fails once you add an integer column.

In terms of how you make a new column based on other columns - if you absolutely need to iterate, then you can assign using loc, as you did in one example. But you should always look for a vectorized solution, then look at apply, and only then think about iterating. See this answer for some more background.

Community
  • 1
  • 1
chrisb
  • 49,833
  • 8
  • 70
  • 70