What is the correct way of doing the same operation over multiple pandas dataframes?

Question

I am trying to:

check if values in a range exist in a dataframe
if not, add the value and interpolate.

Referring to this answer, I have checked that it works for a single dataframe. For example:

# Original dataframe

    code    ratio
...
5   5.0     1.649561
6   6.0     1.466403
7   11.0    1.696970
8   12.0    1.646259

# Code to add row + interpolate
for i in range(5, 13):
    if i not in df.values:
        df.loc[-1, 'code'] = i
        df = df.sort_values('code').reset_index(drop=True)
        df = df.interpolate()

# Result
code        ratio
0   5.0     1.649561
1   6.0     1.466403
2   7.0     1.581686
3   8.0     1.639328
4   9.0     1.668149
5   10.0    1.682559
6   11.0    1.696970
7   12.0    1.646259

Checking that it worked on a single dataframe, I wanted it to be done on multiple dataframes I have. So I tried the following code, using a list of dataframes for iteration:

for df in [df1, df2, df3...]:
    for i in range(5, 13):
        if i not in df.values:
            df.loc[-1, 'code'] = i
            df = df.sort_values('code').reset_index(drop=True)
            df = df.interpolate()

Then even for the dataframe that worked before, it returns:

code        ratio
5   5.0     1.649561
6   6.0     1.466403
7   11.0    1.696970
8   12.0    1.646259
-1  7.0     NaN

Which is clearly not the result I want.

What causes this difference? Is using a list of multiple dataframes for iteration a wrong approach to this?

I'd suggest it's better to use a `dict` in cases like this when modifying DataFrames iteratively - with keys `df1`, `df2`.... — Chris Adams, Jan 29 '20 at 08:33
Iterating using `for name, df in df_dict.items():`, and assigning back with `df_dict[name] = df` doesn't seem to work as well :( — wookiekim, Jan 29 '20 at 08:34
how about just unpacking back to original variable names.... so after the loop I suggested, final line would be `df1, df2, df3 = df_list` — Chris Adams, Jan 29 '20 at 08:35
check out [this question](https://stackoverflow.com/questions/14814771/do-python-for-loops-work-by-reference) and [this one](https://stackoverflow.com/questions/49986865/modifying-dataframes-inside-a-list-is-not-working) — Chris Adams, Jan 29 '20 at 08:47

Chris Adams · Accepted Answer · 2020-01-29T08:37:36.800

1

You need to assign back into the list and then unpack, for example something like:

df_list = [df1, df2, df3...]
for i, df in enumerate(df_list):
    for j in range(5, 13):
        if j not in df.values:
            df.loc[-1, 'code'] = j
            df = df.sort_values('code').reset_index(drop=True)
            df = df.interpolate()
    df_list[i] = df

#Unpack back to original variables
df1, df2, df3, ... = df_list

edited Jan 29 '20 at 08:37

answered Jan 29 '20 at 08:16

Chris Adams

18,389
4
22
39

I'd recommend using a `dict` as a container for your DataFrames with keys `df1`, `df2`... rather than `list` here though – Chris Adams Jan 29 '20 at 08:19
Sorry, it doesn't seem to be working. Say `df1` is the dataframe I primarily checked with, and after using your code and checking again with `df1`, I still get the unwanted result in my question. Even after changing `df.loc[-1, 'code'] = i` to `df.loc[-1, 'code'] = j` – wookiekim Jan 29 '20 at 08:21
yeah it's not going to change that variable if you use a list, but it will change `df_list[0]` - this is why it'd be better to use a dict `df_dict = {'df1': ...}`, then you can iterate and update the values and return it with `df_dict['df1']` – Chris Adams Jan 29 '20 at 08:24
Okay, that would call for some changes to the code overall... But will try. I won't be able to accept your answer as-is even if it works though. – wookiekim Jan 29 '20 at 08:28

score 0 · Answer 2 · answered Jan 29 '20 at 09:04

You may use inplace=True to modify directly on each dataframe in the list. Since list of dataframes are shallow copies of each dataframe, any modify on them will affect original dataframe. However, this method doesn't allow chaining methods, so you need to break chain command into individual method calls with inplace=True. This method takes advantage of the list create shallow copy of dataframes

Sample dataframes

In [153]: df1
Out[153]:
   code     ratio
0   5.0  1.649561
1   6.0  1.466403
2  11.0  1.696970
3  12.0  1.646259

In [155]: df2
Out[155]:
   code     ratio
0   5.0  1.649561
1   6.0  1.466403
2  11.0  1.696970
3  19.0  1.646259

dfs = [df1, df2]

for df in dfs:
    for i in range(5, 13):
        if i not in df.values:
            df.loc[-1, 'code'] = i
            df.sort_values('code', inplace=True)
            df.reset_index(drop=True, inplace=True)
            df.interpolate(inplace=True)

Output:

In [168]: df1
Out[168]:
   code     ratio
0   5.0  1.649561
1   6.0  1.466403
2   7.0  1.581686
3   8.0  1.639328
4   9.0  1.668149
5  10.0  1.682560
6  11.0  1.696970
7  12.0  1.646259

In [169]: df2
Out[169]:
   code     ratio
0   5.0  1.649561
1   6.0  1.466403
2   7.0  1.581686
3   8.0  1.639328
4   9.0  1.668149
5  10.0  1.682560
6  11.0  1.696970
7  12.0  1.671615
8  19.0  1.646259

Note: this solution is just to demonstrate that it is doable on this specific question. On more complex issue, it won't be feasible due to some commands don't support inplace and Pandas is deprecating inplace option.

What is the correct way of doing the same operation over multiple pandas dataframes?

2 Answers2