Iterating over different data frames using an iterator

Question

Suppose I have n number of data frames df_1, df_2, df_3, ... df_n, containing respectively columns named SPEED1 ,SPEED2, SPEED3, ..., SPEEDn, for instance:

import numpy as np
df_1 = pd.DataFrame({'SPEED1':np.random.uniform(0,600,100)})
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(0,600,100)})

and I want to make the same changes to all of the data frames. How do I do so by defining a function on similar lines?

def modify(df,nr):
    df_invalid_nr=df_nr[df_nr['SPEED'+str(nr)]>500]
    df_valid_nr=~df_invalid_nr
    Invalid_cycles_nr=df[df_invalid]
    df=df[df_valid]
    print(Invalid_cycles_nr)
    print(df)

So, when I try to run the above function

modify(df_1,1)

It returns the entire data frame without modification and the invalid cycles as an empty array. I am guessing I need to define the modification on the global dataframe somewhere in the function for this to work.

I am also not sure if I could do this another way, say just looping an iterator through all the data frames. But, I am not sure it will work.

for i in range(1,n+1):
    df_invalid_i=df_i[df_i['SPEED'+str(i)]>500]
    df_valid_i=~df_invalid_i
    Invalid_cycles_i=df[df_invalid]
    df=df[df_valid]
    print(Invalid_cycles_i)
    print(df)

How do I, in general, access df_1 using an iterator? It seems to be a problem.

Any help would be appreciated, thanks!

For your second point you can try solution here https://stackoverflow.com/a/17960039/3941704. Is it possible for you to give a reproducible example of what you expect ? — David Leon, Jan 31 '18 at 07:46
I am not sure of what you want to do? I understand that you what to filter all your `df` in order to get on one hand the valid, and on the other hand the invalid? Does each `df` contains all the speeds (i.e. `df1` has `speed1,..,speedn`) or each `df` has one speed only (i.e. `df1` has `speed1`, `df2` has `speed2`, and so on)? — David Leon, Jan 31 '18 at 08:03
ok, then please have a look at my answer https://stackoverflow.com/a/48537995/3941704 — David Leon, Jan 31 '18 at 09:40

score 1 · Answer 1 · answered Jan 31 '18 at 08:30

You can use the globals() function which allows you to get a variable by his name.

I just add df_i = globals()["df_"+str(i)] at the begining of the for loop :

for i in range(1,n+1):
    df_i = globals()["df_"+str(i)]
    df_invalid_i=df_i.loc[df_i['SPEED'+str(i)]>500]
    df_valid_i=~df_invalid_i
    Invalid_cycles_i=df[df_invalid]
    df=df[df_valid]
    print(Invalid_cycles_i)
    print(df)

David Leon · Accepted Answer · 2018-02-01T15:44:26.920

Solution

Inputs

import pandas as pd
import numpy as np 

df_1 = pd.DataFrame({'SPEED1':np.random.uniform(1,600,100))
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(1,600,100))

Code

To my mind a better approach would be to store your dfs into a list and enumerate over it for augmenting informations into your dfs to create a valid column:

for idx, df in enumerate([df_1, df_2]):
    col = 'SPEED'+str(idx+1)
    df['valid'] = df[col] <= 500

print(df_1)

        SPEED  valid
0  516.395756  False
1   14.643694   True
2  478.085372   True
3  592.831029  False
4    1.431332   True

You can then filter for valid or invalid with df_1[df_1.valid] or df_1[df_1.valid == False]

It is a solution to fit your problem, see Another solution that may be more clean and Notes below for explanations you need.

Another (better?) solution

If it is possible for you re-think your code. Each DataFrame has one column speed, then name it SPEED:

dfs = dict(df_1=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}),
           df_2=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}))

It will allow you to do the following one liner:

dfs = dict(map(lambda key_val: (key_val[0],
                                key_val[1].assign(valid = key_val[1]['SPEED'] <= 500)),
               dfs.items()))

print(dfs['df_1'])

        SPEED  valid
0  516.395756  False
1   14.643694   True
2  478.085372   True
3  592.831029  False
4    1.431332   True

Explanations:

dfs.items() returns a list of key (i.e. names) and values (i.e. DataFrames)
map(foo, bar) apply the function foo (see this answer, and DataFrame assign) to all the elements of bar (i.e. to all the key/value pairs of dfs.items().
dict() cast the map to a dict.

Notes

About `modify`

Notice that your function modify is not returning anything... I suggest you to have more readings on mutability and immutability in Python. This article is interesting.

You can then test the following for instance:

def modify(df):
    df=df[df.SPEED1<0.5]
    #The change in df is on the scope of the function only, 
    #it will not modify your input, return the df...
    return df

#... and affect the output to apply changes
df_1 = modify(df_1)

About access `df_1` using an iterator

Notice that when you do:

for i in range(1,n+1):
    df_i something

df_i in your loop will call the object df_i for each iteration (and not df_1 etc.) To call an object by its name, use globals()['df_'+str(i)] instead (Assuming that df_1 to df_n+1 are located in globals()) - from this answer.

To my mind it is not a clean approach. I don't know how do you create your DataFrames but if it is possible for your I will suggest you to store them into a dictionary instead affecting manually:

dfs = {}
dfs['df_1'] = ...

or a bit more automatically if df_1 to df_n already exist - according to first part of vestland answer :

dfs = dict((var, eval(var)) for
           var in dir() if
           isinstance(eval(var), pd.core.frame.DataFrame) and 'df_' in var)

Then it would be easier for your to iterate over your DataFrames:

for i in range(1,n+1):
    dfs['df_'+str(i)'] something

score 1 · Answer 3 · answered Feb 01 '18 at 10:52

Your code sample leaves me a little confused, but focusing on

I want to make the same changes to all of the data frames.

and

How do I, in general, access df_1 using an iterator?

you can do exactly that by organizing your dataframes (dfs) in a dictionary (dict).

Here's how:

Assuming you've got a bunch of variables in your namespace...

# Imports
import pandas as pd
import numpy as np

# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['a', 'b']) 
df_1 = df_1.set_index(rng)

# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['c', 'd']) 
df_2 = df_2.set_index(rng)

# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['e', 'f']) 
df_3 = df_3.set_index(rng)

...you can identify all that are dataframes using:

alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]

If you've got a lot of different dataframes but would only like to focus on those that have a prefix like 'df_', you can identify those by...

dfNames = []
for elem in alldfs:
   if str(elem)[:3] == 'df_':
       dfNames.append(elem)

... and then organize them in a dict using:

myFrames = {}
for dfName in dfNames:
    myFrames[dfName] = eval(dfName)

From that list of interesting dataframes, you can subset those that you'd like to do something with. Here's how you focus only on df_1 and df_2:

invalid = ['df_3']
for inv in invalid:
    myFrames.pop(inv, None)

Now you can reference ALL your valid dfs by looping through them:

for key in myFrames.keys():
    print(myFrames[key])

And that should cover the...

How do I, in general, access df_1 using an iterator?

...part of the question.

And you can of course reference a single dataframe by its name / key in the dict:

print(myFrames['df_1'])

From here you can do something with ALL columns in ALL dataframes.

for key in myFrames.keys():
    myFrames[key] = myFrames[key]*10
    print(myFrames[key])

Or, being a bit more pythonic, you can specify a lambda function and apply that to a subset of columns

# A function
decimator = lambda x: x/10

# A subset of columns:
myCols = ['SPEED1', 'SPEED2']

Apply that function to your subset of columns in your dataframes of interest:

for key in myFrames.keys():
    for col in list(myFrames[key]):
        if col in myCols:
            myFrames[key][col] = myFrames[key][col].apply(decimator)
            print(myFrames[key][col])

So, back to your function...

modify(df_1,1)

... here's my take on it wrapped in a function.

First we'll redefine the dataframes and the function. Oh, and with this setup, you're going to have to obtain all dfs OUTSIDE your function with alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]. Here's the datasets and the function for an easy copy-paste:

# Imports
import pandas as pd
import numpy as np

# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3']) 
df_1 = df_1.set_index(rng)

# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3']) 
df_2 = df_2.set_index(rng)

# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3']) 
df_3 = df_3.set_index(rng)

# A function that divides columns by 10
decimator = lambda x: x/10

# A reference to all available dataframes
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]

# A function as per your request
def modify(dfs, cols, fx):

    """ Define a subset of available dataframes and list of interesting columns, and
        apply a function on those columns.
    """


    # Subset all dataframes with names that start with df_
    dfNames = []
    for elem in alldfs:
       if str(elem)[:3] == 'df_':
           dfNames.append(elem)

    # Organize those dfs in a dict if they match the dataframe names of interest
    myFrames = {}
    for dfName in dfNames:
        if dfName in dfs:    
            myFrames[dfName] = eval(dfName)
            print(myFrames)

    # Apply fx to the cols of your dfs subset
    for key in myFrames.keys():
        for col in list(myFrames[key]):
            if col in cols:
                myFrames[key][col] = myFrames[key][col].apply(decimator)

# A testrun. Results in screenshots below
modify(dfs = ['df_1', 'df_2'], cols = ['SPEED1', 'SPEED2'], fx = decimator)

Here are dataframes df_1 and df_2 before manipulation:

Here are the dataframes after manipulation:

Anyway, this is how I would approach it.

Hope you'll find it useful!

Iterating over different data frames using an iterator

3 Answers3

Solution

Inputs

Code

Another (better?) solution

Notes

About `modify`

About access `df_1` using an iterator

Linked

Iterating over different data frames using an iterator

3 Answers3

Solution

Inputs

Code

Another (better?) solution

Notes

About modify

About access df_1 using an iterator

Linked

About `modify`

About access `df_1` using an iterator