how can i def function for new Dataframe with Cleaned data

Question

I have several dataframes where I need to reduce the dataframe to a time span for all of them. So that I don't have to reduce the codeblock over and over again, I would like to write a function.

Currently everything is realized without working by the following code:

timerange = (df_a['Date'].max() - pd.DateOffset(months=11))
df_a_12m = df_a.loc[df_a['Date'] >= timerange]

my approach:

def Time_range(Data_1, x,name, column, name):
   t = Data_1[column].max() - pd.DateOffset(months=x)
   'df'_ + name = Data_1.loc[Data_1[column] >= t]

unfortunately this does not work

stelioslogothetis · Accepted Answer · 2022-07-28T11:36:38.923

There are a few mistakes in your approach. Firstly, when you create a new variable you need to specify exactly what it will be called. It is not possible to "dynamically" name a variable like you're trying with 'df_' + name = something.

Second, variable scope dictates that any variable created in a function is only accessible inside that function, and ceases to exist once it finishes executing (unless you play special tricks with global variables). So, even if you did df_name = Data_1.loc[Data_1[column] >= t], once Time_range() finishes running, that variable will be deleted.

What you can do is have the function return the finished DataFrame and assign the result as a new variable from the outside:

def Time_range(Data_1, x, column):
    t = Data_1[column].max() - pd.DateOffset(months=x)
    return Data_1.loc[Data_1[column] >= t].copy()

df_any_name_you_want = Time_range(df_a, 11, 'Date')

Generally, this is what you want functions to do. Do some operations and return a finished value that you can then use from the outside.

Thanks! sometimes i have to think simple – Steve Jul 28 '22 at 11:26 — Steve, Jul 28 '22 at 11:26

Jacob · Answer 2 · 2022-07-28T11:57:46.093

0

My approach would be:

Store your dataframes in a list e.g. dfs=[df_a,df_b]

Build a function from your approach. Input: (df, DeltaT=1, colName='Date'), Output: modified DataFrame

 def Time_range(df, DeltaT=1, colName='Date'): # Default Values for Delat T and colName. Helpful if constant in most of the cases.
    t = df[colName].max() - pd.DateOffset(months=DeltaT)
    df = df.loc[df[colName] >= t].copy() # Good advise to use copy() to ensure that you do not work on your original data by mistake. Espacially with the inplace=True argument you will increase the risk of un-expected behaviour 
    return df # Important: You have to return the result of your function

Call your function with your list

 result=[] #list for modified dfs
 for df in dfs:
     results.append(Time_range(df, DeltaT=2))

Important code was not tested. Might contain typos

Edit Formatting

Edit 2 Due to the discussion on my comment on the copy() command a small example with proper formatting:

import pandas as pd

def EmptyDataFrameInplace(df):
    df.drop('A', axis=1, inplace=True)

def EmptyDataFrame(df):
    df=df.drop('A', axis=1)

dfA=pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]})
dfB=dfA.copy()
print(dfA.head())

EmptyDataFrameInplace(dfA)
EmptyDataFrame(dfB)

print(dfA.head())
print(dfB.head())

The result looks like this:

Also see here Thus, I try always to use copy() to ensure that I don't modifiy a dataframe without notice.

edited Jul 28 '22 at 11:57

answered Jul 28 '22 at 11:30

Jacob

304
1
6

Good idea to add the call to `copy()`. However, your comment that not including it causes the dataframe to be modified is incorrect. DataFrames, as mutable objects, are [passed by assignment](https://stackoverflow.com/a/986145/7662085). When you do `df = value` inside the function, all you do is overwrite the *local variable* `df` *inside that function*. The DataFrame, in memory, is not affected. `df` just stops pointing to it and points to something new. Even if the DataFrame is called `df` *outside* the function. – stelioslogothetis Jul 28 '22 at 11:34
The advantage of calling `copy()`, however, is because `.loc` returns a *selection* from the original DataFrame. This is not a *separate* DataFrame, just a window to the old one which is re-computed when you access it. Complicated operations on this selection become difficult, and Pandas will complain at you. – stelioslogothetis Jul 28 '22 at 11:36
@stelioslogothetis This was my thought, too, however pandas does not work like that. Just run the following snippet as example: import pandas as pd def EmptyDataFrameInplace(df): df.drop('A', axis=1, inplace=True) def EmptyDataFrame(df): df.drop('A', axis=1, inplace=True) dfA=pd.DataFrame({'A':[1,2,3], 'B':[4,5,6]}) dfB=dfA.copy() print(dfA.head()) EmptyDataFrameInplace(dfA) EmptyDataFrame(dfB) print(dfA.head()) print(dfB.head()) Both print-statements are missung the column A *I guess formatting of code is not possible in comments* – Jacob Jul 28 '22 at 11:43
In this case you are explicitly calling `drop` with `inplace=True`. If you omitted that, `drop` would by default return a *copy* of the DataFrame with `'A'` removed. Similarly, `.loc` will return a view into the DataFrame. However, it will *not* overwrite it, the full DataFrame will continue to exist. You do have a point that any operations on the returned `df.loc` will affect the original DataFrame if `copy` is not called. – stelioslogothetis Jul 28 '22 at 11:50
@stelioslogothetis you are absolutly right, without inplace the original object is not modified. In my example I had a copy-paste mistake. Thank you for clarification. :) – Jacob Jul 28 '22 at 11:55

how can i def function for new Dataframe with Cleaned data

2 Answers2