Loops over lists of Pandas objects exhibit weird behavior

Question

I have encountered a slight head scratcher when it comes to lists of Pandas objects and their loops. In some code I was working on, there were a few pandas dataframes which were placed into a list, so operations could be performed on all of them.

However, I noticed that certain operations, such as creating new columns, work in "naive" Python for loops, whereas other operations, like reversing the orders of the dataframes,

require explicit indexing, and
do not effect the original dataframes (only their copies residing within the list).

I am seeking help in getting the second part of my MWE below working as easily as the first part, and also to gain insight into understanding what underlying logic is causing this discrepancy in the first place.

## Creating data
import pandas as pd
from io import StringIO

data = StringIO(
"""
date;time;random
2019-06-12;19:59:59+00:00;99
2019-06-12;19:59:54+00:00;200
2019-06-12;19:59:52+00:00;65
2019-06-12;19:59:34+00:00;140
"""
               )

df = pd.read_csv(data, sep=";")

print(df)

## Creating list; there is only one dataframe in this list to make the
## code easier to work with, but in actuality I am working with >20 dataframes
df_list = [df]

## First operation - successfully adds new column to both original df and df_list[0]
for dataframe in df_list:
    dataframe['date_time'] = pd.to_datetime(dataframe['date']+' '+dataframe['time'], utc=True)
print(df)
print(df_list[0])

## Second operation - successful only if using explicit indexing over list, first commented segment does nothing;
## using second segment works, but does not effect original df, only df_list[0].

# for dataframe in df_list:
#     dataframe = dataframe.iloc[::-1]
#     dataframe.reset_index(drop=True, inplace=True)

for i in range(len(df_list)):
    df_list[i] = df_list[i].iloc[::-1]
    df_list[i].reset_index(drop=True, inplace=True)

print(df)
print(df_list[0])

`for in` loops does not work by reference. Use `enumerate` and then reference `df_list[i]`. — Vishnudev Krishnadas, Nov 13 '19 at 04:18
What is the point of `df_list`, what are you trying to do here? When working with Pandas is it best to avoid explicit loops as much as possible, so this raises some questions. Where does the first/original DataFrame come from? — AMC, Nov 13 '19 at 04:19
@AlexanderCécile, I think the reason why using a list is that there might not just one dataframe, but a list of dataframes. — Bill Chen, Nov 13 '19 at 04:24
@BillChen That’s certainly possible, I’m just confused by `df_list = [df]`. — AMC, Nov 13 '19 at 04:26
@BillChen Ah you might be right actually, I just noticed he says in the post that there are multiple DataFrames in a list. — AMC, Nov 13 '19 at 04:33
Apologies, I should have made it clearer that I have multiple dataframes, but only created a list of one dataframe for ease-of-example. Edited question as such. — Coolio2654, Nov 13 '19 at 04:39
Hi Coolio2654, thanks for your recent note on your (now) deleted post. I'm aware that we don't have full consensus on succinct/technical writing, though it still reflects moderation/editing policy for now (references: [here](https://meta.stackoverflow.com/q/260776), [here](https://meta.stackoverflow.com/q/266525), [here](https://meta.stackexchange.com/q/2950), [here](https://meta.stackoverflow.com/q/361434)). — halfer, Sep 19 '20 at 21:25
There was a hint from Stack Overflow Inc in a blog post that, as part of a welcoming initiative, the guidelines/ethos on technical writing could be relaxed in favour of a conversational/forum style, but there was such a backlash from the community (largely for other reasons) that such a change would now be most unlikely to come from the company themselves. Of course anyone is free to suggest it on Meta, but my view at present is that the wider community aren't likely to embrace it. — halfer, Sep 19 '20 at 21:26
(If the majority view on this were to change, then I would respect that shift, but paradoxically at that point there may be less need for editors, since if "anything goes" then posts might as well be left as the author wrote them). — halfer, Sep 19 '20 at 21:30

score 2 · Answer 1 · answered Nov 13 '19 at 04:50

2

The first operation, dataframe['date_time']= suggests it's an in-place operation, which is not an assignment.

The reason why in the second operation, the second approach works, it is because when you loop through a list not using the index, you created a new variable that not related to the list, and assign it to a new value.

a = [1,2,3]
for i in a:
    i = 0
print(a)
print(i)

The output is:

[1, 2, 3]
0

So in your case, when you for dataframe in df_list:, you create a new variable dataframe, that refer or point to the address of each element in the df_list. Then when you assign them to the reversed data frame, dataframe refers or points to a new variable.

The problem here is you (or we) confused in-place operation vs assignment.

answered Nov 13 '19 at 04:50

Bill Chen

1,699
14
24

I still don't understand, however. Both `dataframe['date_time'] = pd.to_datetime()` and `dataframe = dataframe.iloc[::-1]` look like they are setting something. In the first case, it is setting an aspect of a dataframe, a new column, equal to something, and in the second case, it is setting the dataframe equal to a reversed version of itself. Both situations look perfectly equivalent to me, so I still do not understand where I am wrong. – Coolio2654 Nov 13 '19 at 06:57
1

When you use `dataframe['date_time'] = ` you are change on the `dataframe` itself. This is because there's an underlining operation, column selection `['date_time']` happened. This operation is provided by the pandas, which is an in-place operation. But if you only use python native `=`, here it just passes the reference to a new variable. I hope I can draw for you, but try to play it around with simple variables, it takes a while for me to realize it, there might be other answers coming that explain better than me. :-) – Bill Chen Nov 13 '19 at 16:51

score 1 · Answer 2 · answered Jan 28 '20 at 15:22

I found that the main point in your question is the premise that all kinds of operations over the shallow copied element(df_list[0]) will be reflected in the original mutable instance(df), but it does not include the assignment as explained here: Python: Assignment vs Shallow Copy vs Deep Copy.

Let's see this normal example:

In [29]: df_list = [df]

In [30]: df_list[0]['date_time'] = pd.to_datetime(df_list[0]['date']+' '+df_list[0]['time'], utc=True)

In [31]: df_list
Out[31]:
[         date            time  random                 date_time
 0  2019-06-12  19:59:59+00:00      99 2019-06-12 19:59:59+00:00
 1  2019-06-12  19:59:54+00:00     200 2019-06-12 19:59:54+00:00
 2  2019-06-12  19:59:52+00:00      65 2019-06-12 19:59:52+00:00
 3  2019-06-12  19:59:34+00:00     140 2019-06-12 19:59:34+00:00]

In [32]: df
Out[32]:
         date            time  random                 date_time
0  2019-06-12  19:59:59+00:00      99 2019-06-12 19:59:59+00:00
1  2019-06-12  19:59:54+00:00     200 2019-06-12 19:59:54+00:00
2  2019-06-12  19:59:52+00:00      65 2019-06-12 19:59:52+00:00
3  2019-06-12  19:59:34+00:00     140 2019-06-12 19:59:34+00:00

It works as expected. That is df_list has its own pointer but df_list[0] and df share the same pointer, then df changes when df_list[0] changes.

In [35]: hex(id(df))
Out[35]: '0x7f2c90e8d978'

In [36]: hex(id(df_list[0]))
Out[36]: '0x7f2c90e8d978'

In [37]: hex(id(df_list))
Out[37]: '0x7f2c90d68188'

The method to check the memory address of Python variable: answer to "print memory address of Python variable [duplicate]"

But in the following example, we are facing a different scenario.

In [22]: df_list = [df]
In [23]: df_list[0] = df_list[0].iloc[::-1]

In [24]: df_list
Out[24]:
[         date            time  random
 3  2019-06-12  19:59:34+00:00     140
 2  2019-06-12  19:59:52+00:00      65
 1  2019-06-12  19:59:54+00:00     200
 0  2019-06-12  19:59:59+00:00      99]

In [25]: df
Out[25]:
         date            time  random
0  2019-06-12  19:59:59+00:00      99
1  2019-06-12  19:59:54+00:00     200
2  2019-06-12  19:59:52+00:00      65
3  2019-06-12  19:59:34+00:00     140

In [26]: df_list[0]['date_time'] = pd.to_datetime(df_list[0]['date']+' '+df_list[0]['time'], utc=True)
/usr/bin/ipython3:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  #! /bin/sh
In [27]: df_list
Out[27]:
[         date            time  random                 date_time
 3  2019-06-12  19:59:34+00:00     140 2019-06-12 19:59:34+00:00
 2  2019-06-12  19:59:52+00:00      65 2019-06-12 19:59:52+00:00
 1  2019-06-12  19:59:54+00:00     200 2019-06-12 19:59:54+00:00
 0  2019-06-12  19:59:59+00:00      99 2019-06-12 19:59:59+00:00]

In [28]: df
Out[28]:
         date            time  random
0  2019-06-12  19:59:59+00:00      99
1  2019-06-12  19:59:54+00:00     200
2  2019-06-12  19:59:52+00:00      65
3  2019-06-12  19:59:34+00:00     140

The reason is that we have made operations like adding and removing items, meaning that we have first removed the df_list[0] and then adding a new df_list[0](or replacing) which will both not reflected in the original mutable object instance.

In [40]: hex(id(df_list[0]))
Out[40]: '0x7f2c90d6ea58'

In [41]: hex(id(df))
Out[41]: '0x7f2c90e8d978'

As we can see the pointer of df_list[0] has changed.

Let's see the following simple illustration:

In [44]: a = [[1, 2, 3], [4, 5]]

In [45]: b = a[:]

In [46]: a[0] = [0, 0, 0]

In [47]: b
Out[47]: [[1, 2, 3], [4, 5]]

In [48]: a
Out[48]: [[0, 0, 0], [4, 5]]

It may be not caused by the for loop as you suspected but the discrepancy between the assignment and shallow copy. HTH :)

Then, is there a way in Python to create a list of more complex objects - in here, pandas dataframes - and change the underlying objects through operations through said list? — Coolio2654, Jan 29 '20 at 02:51
@Coolio2654 I thought the underlying object can be changed if the pointer remains the same. — Lerner Zhang, Jan 29 '20 at 04:34

Loops over lists of Pandas objects exhibit weird behavior

2 Answers2