1

Suppose I have n number of data frames df_1, df_2, df_3, ... df_n, containing respectively columns named SPEED1 ,SPEED2, SPEED3, ..., SPEEDn, for instance:

import numpy as np
df_1 = pd.DataFrame({'SPEED1':np.random.uniform(0,600,100)})
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(0,600,100)})

and I want to make the same changes to all of the data frames. How do I do so by defining a function on similar lines?

def modify(df,nr):
    df_invalid_nr=df_nr[df_nr['SPEED'+str(nr)]>500]
    df_valid_nr=~df_invalid_nr
    Invalid_cycles_nr=df[df_invalid]
    df=df[df_valid]
    print(Invalid_cycles_nr)
    print(df)

So, when I try to run the above function

modify(df_1,1)

It returns the entire data frame without modification and the invalid cycles as an empty array. I am guessing I need to define the modification on the global dataframe somewhere in the function for this to work.

I am also not sure if I could do this another way, say just looping an iterator through all the data frames. But, I am not sure it will work.

for i in range(1,n+1):
    df_invalid_i=df_i[df_i['SPEED'+str(i)]>500]
    df_valid_i=~df_invalid_i
    Invalid_cycles_i=df[df_invalid]
    df=df[df_valid]
    print(Invalid_cycles_i)
    print(df)

How do I, in general, access df_1 using an iterator? It seems to be a problem.

Any help would be appreciated, thanks!

David Leon
  • 1,017
  • 8
  • 25
hegdep
  • 596
  • 1
  • 6
  • 16
  • 2
    For your second point you can try solution here https://stackoverflow.com/a/17960039/3941704. Is it possible for you to give a reproducible example of what you expect ? – David Leon Jan 31 '18 at 07:46
  • I am not sure of what you want to do? I understand that you what to filter all your `df` in order to get on one hand the valid, and on the other hand the invalid? Does each `df` contains all the speeds (i.e. `df1` has `speed1,..,speedn`) or each `df` has one speed only (i.e. `df1` has `speed1`, `df2` has `speed2`, and so on)? – David Leon Jan 31 '18 at 08:03
  • Each df has one speed only. – hegdep Jan 31 '18 at 09:32
  • ok, then please have a look at my answer https://stackoverflow.com/a/48537995/3941704 – David Leon Jan 31 '18 at 09:40

3 Answers3

1

You can use the globals() function which allows you to get a variable by his name.

I just add df_i = globals()["df_"+str(i)] at the begining of the for loop :

for i in range(1,n+1):
    df_i = globals()["df_"+str(i)]
    df_invalid_i=df_i.loc[df_i['SPEED'+str(i)]>500]
    df_valid_i=~df_invalid_i
    Invalid_cycles_i=df[df_invalid]
    df=df[df_valid]
    print(Invalid_cycles_i)
    print(df)
Thomas
  • 1,164
  • 13
  • 41
1

Solution

Inputs

import pandas as pd
import numpy as np 

df_1 = pd.DataFrame({'SPEED1':np.random.uniform(1,600,100))
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(1,600,100))

Code

To my mind a better approach would be to store your dfs into a list and enumerate over it for augmenting informations into your dfs to create a valid column:

for idx, df in enumerate([df_1, df_2]):
    col = 'SPEED'+str(idx+1)
    df['valid'] = df[col] <= 500

print(df_1)

        SPEED  valid
0  516.395756  False
1   14.643694   True
2  478.085372   True
3  592.831029  False
4    1.431332   True

You can then filter for valid or invalid with df_1[df_1.valid] or df_1[df_1.valid == False]

It is a solution to fit your problem, see Another solution that may be more clean and Notes below for explanations you need.


Another (better?) solution

If it is possible for you re-think your code. Each DataFrame has one column speed, then name it SPEED:

dfs = dict(df_1=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}),
           df_2=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}))

It will allow you to do the following one liner:

dfs = dict(map(lambda key_val: (key_val[0],
                                key_val[1].assign(valid = key_val[1]['SPEED'] <= 500)),
               dfs.items()))

print(dfs['df_1'])

        SPEED  valid
0  516.395756  False
1   14.643694   True
2  478.085372   True
3  592.831029  False
4    1.431332   True

Explanations:

  • dfs.items() returns a list of key (i.e. names) and values (i.e. DataFrames)
  • map(foo, bar) apply the function foo (see this answer, and DataFrame assign) to all the elements of bar (i.e. to all the key/value pairs of dfs.items().
  • dict() cast the map to a dict.

Notes

About modify

Notice that your function modify is not returning anything... I suggest you to have more readings on mutability and immutability in Python. This article is interesting.

You can then test the following for instance:

def modify(df):
    df=df[df.SPEED1<0.5]
    #The change in df is on the scope of the function only, 
    #it will not modify your input, return the df...
    return df

#... and affect the output to apply changes
df_1 = modify(df_1)

About access df_1 using an iterator

Notice that when you do:

for i in range(1,n+1):
    df_i something

df_i in your loop will call the object df_i for each iteration (and not df_1 etc.) To call an object by its name, use globals()['df_'+str(i)] instead (Assuming that df_1 to df_n+1 are located in globals()) - from this answer.

To my mind it is not a clean approach. I don't know how do you create your DataFrames but if it is possible for your I will suggest you to store them into a dictionary instead affecting manually:

dfs = {}
dfs['df_1'] = ...

or a bit more automatically if df_1 to df_n already exist - according to first part of vestland answer :

dfs = dict((var, eval(var)) for
           var in dir() if
           isinstance(eval(var), pd.core.frame.DataFrame) and 'df_' in var)

Then it would be easier for your to iterate over your DataFrames:

for i in range(1,n+1):
    dfs['df_'+str(i)'] something
David Leon
  • 1,017
  • 8
  • 25
1

Your code sample leaves me a little confused, but focusing on

I want to make the same changes to all of the data frames.

and

How do I, in general, access df_1 using an iterator?

you can do exactly that by organizing your dataframes (dfs) in a dictionary (dict).

Here's how:


Assuming you've got a bunch of variables in your namespace...

# Imports
import pandas as pd
import numpy as np

# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['a', 'b']) 
df_1 = df_1.set_index(rng)

# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['c', 'd']) 
df_2 = df_2.set_index(rng)

# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['e', 'f']) 
df_3 = df_3.set_index(rng)

...you can identify all that are dataframes using:

alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]

If you've got a lot of different dataframes but would only like to focus on those that have a prefix like 'df_', you can identify those by...

dfNames = []
for elem in alldfs:
   if str(elem)[:3] == 'df_':
       dfNames.append(elem)

... and then organize them in a dict using:

myFrames = {}
for dfName in dfNames:
    myFrames[dfName] = eval(dfName)

From that list of interesting dataframes, you can subset those that you'd like to do something with. Here's how you focus only on df_1 and df_2:

invalid = ['df_3']
for inv in invalid:
    myFrames.pop(inv, None)

Now you can reference ALL your valid dfs by looping through them:

for key in myFrames.keys():
    print(myFrames[key])

And that should cover the...

How do I, in general, access df_1 using an iterator?

...part of the question.

And you can of course reference a single dataframe by its name / key in the dict:

print(myFrames['df_1'])

From here you can do something with ALL columns in ALL dataframes.

for key in myFrames.keys():
    myFrames[key] = myFrames[key]*10
    print(myFrames[key])

Or, being a bit more pythonic, you can specify a lambda function and apply that to a subset of columns

# A function
decimator = lambda x: x/10

# A subset of columns:
myCols = ['SPEED1', 'SPEED2']

Apply that function to your subset of columns in your dataframes of interest:

for key in myFrames.keys():
    for col in list(myFrames[key]):
        if col in myCols:
            myFrames[key][col] = myFrames[key][col].apply(decimator)
            print(myFrames[key][col])

So, back to your function...

modify(df_1,1)

... here's my take on it wrapped in a function.

First we'll redefine the dataframes and the function. Oh, and with this setup, you're going to have to obtain all dfs OUTSIDE your function with alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]. Here's the datasets and the function for an easy copy-paste:

# Imports
import pandas as pd
import numpy as np

# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3']) 
df_1 = df_1.set_index(rng)

# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3']) 
df_2 = df_2.set_index(rng)

# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3']) 
df_3 = df_3.set_index(rng)

# A function that divides columns by 10
decimator = lambda x: x/10

# A reference to all available dataframes
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]

# A function as per your request
def modify(dfs, cols, fx):

    """ Define a subset of available dataframes and list of interesting columns, and
        apply a function on those columns.
    """


    # Subset all dataframes with names that start with df_
    dfNames = []
    for elem in alldfs:
       if str(elem)[:3] == 'df_':
           dfNames.append(elem)

    # Organize those dfs in a dict if they match the dataframe names of interest
    myFrames = {}
    for dfName in dfNames:
        if dfName in dfs:    
            myFrames[dfName] = eval(dfName)
            print(myFrames)

    # Apply fx to the cols of your dfs subset
    for key in myFrames.keys():
        for col in list(myFrames[key]):
            if col in cols:
                myFrames[key][col] = myFrames[key][col].apply(decimator)

# A testrun. Results in screenshots below
modify(dfs = ['df_1', 'df_2'], cols = ['SPEED1', 'SPEED2'], fx = decimator)

Here are dataframes df_1 and df_2 before manipulation:

enter image description here

Here are the dataframes after manipulation:

enter image description here

Anyway, this is how I would approach it.

Hope you'll find it useful!

vestland
  • 55,229
  • 37
  • 187
  • 305