How to store a new dataframe after using a self defined function on it?

Question

I am just starting to use user-defined functions, so this is probably not a very complex question, forgive me.

I have a few dataframes, which all have a column named 'interval_time' (for example) and I would like to rename this column 'Timestamp'.

I know that I can do this manually with this;

df = df.rename(index=str, columns={'interval_time': 'Timestamp'})

but now I would like to define a function called rename that does this for me. I have seen that this works;

def rename(data):
    print(data.rename(index=str, columns={'interval_time': 'Timestamp'}))

but I can't seem to figure out to save the renamed dataframe, I have tried this;

def rename(data):
    data = data.rename(index=str, columns={'interval_time': 'Timestamp'})

The dataframes that I am using have the following form;

df_scada
              interval_time                 A         ...             X                 Y 
0       2010-11-01 00:00:00                0.0        ...                396.36710         381.68860
1       2010-11-01 00:05:00                0.0        ...                392.97974         381.40634
2       2010-11-01 00:10:00                0.0        ...                390.15695         379.99493
3       2010-11-01 00:15:00                0.0        ...                389.02786         379.14810

What about `return data.rename(...)` inside `rename` function and then `df = rename(df)`? — running.t, Jul 06 '18 at 10:15

score 3 · Accepted Answer · answered Jul 06 '18 at 10:21

3

There are a few points to note:

You need to use return in your function.
It's good practice to make such functions generic. For example, your input and output column names can be arguments with default values set.
Pandas offers pd.DataFrame.pipe to facilitate method chaining.
You should not name your function the same as the Pandas method. This will only lead to confusion.

Putting these elements together:

def rename_col(data, col_in='interval_time', col_out='Timestamp'):
    return data.rename(index=str, columns={col_in: col_out})

df = df.pipe(rename_col)

This is a trivial example, which probably doesn't require a user-defined function. However, the above advice may help when you write more complex functions.

answered Jul 06 '18 at 10:21

jpp

159,742
34
281
339

I agree that this is quite trivial, I could have done it more simply another way, I am just starting to understand how to use user-defined functions, so thought this was a good thing to try – Luka Vlaskalic Jul 06 '18 at 10:34
1

@LukaVlaskalic, No problem, I thought so, which is why I thought I'd give some extra pointers :) – jpp Jul 06 '18 at 10:34
I just updated the question with a further complexity – Luka Vlaskalic Jul 06 '18 at 12:21
1

I've rolled back. Please ask as a [new question](https://stackoverflow.com/questions/ask). Since there are already 3 answers, it's not practical for everyone to update their answers with the new requirement. – jpp Jul 06 '18 at 12:22
So if you really want to improve on pandas, check the brilliant [Modern Pandas](https://tomaugspurger.github.io/modern-1-intro.html) blog series. – Quickbeam2k1 Jul 06 '18 at 14:10
I didn't realise that everyone would need to change their answers, I just wanted a little bit of further help, and unfortunately, I can only post a question every 90 mins. But no worries, I now managed to post the question, thank you – Luka Vlaskalic Jul 06 '18 at 14:51
@LukaVlaskalic, Yep, unfortunately that's how SO work. Many people view all the answers (each may have a different valid solution), so having incomplete ones spoils the party. – jpp Jul 06 '18 at 14:52
I gathered as much, for sure makes sense – Luka Vlaskalic Jul 06 '18 at 14:55

jnd940 · Answer 2 · 2018-07-06T10:25:11.690

2

Without inplace=True, the function creates a new object, which needs to be returned:

import pandas as pd

def rename(data):
    return data.rename(index=str, columns={'interval_time': 'Timestamp'})

data = pd.DataFrame([1,2,3,4], columns=['interval_time'])
renamed_data = rename(data)

If no new DF should be created, set inplace=True in the function.

edited Jul 06 '18 at 10:25

answered Jul 06 '18 at 10:19

jnd940

21
3

kosnik · Answer 3 · 2018-07-06T10:27:05.840

0

You do not need to re-assign the dataframe after you call the rename function because pandas.DataFrame is a mutable object and therefore it is passed by reference. Have a look on this link on how python objects are passed

https://jeffknupp.com/blog/2012/11/13/is-python-callbyvalue-or-callbyreference-neither/

Also, you should use the inplace property so that you do not create a new object inside the function. Your code in the rename function will then look like

def rename(data):
    data.rename(index=str, columns={'interval_time': 'Timestamp'}, inplace=True)

After you call rename(df) your DataFrame df has its columns renamed.

edited Jul 06 '18 at 10:27

answered Jul 06 '18 at 10:19

kosnik

2,342
10
23

actually, using inplace is very often [discouraged] (https://stackoverflow.com/questions/45570984/pandas-is-inplace-true-considered-harmful-or-not). A better solutions btw would just be to not create a new function and just use `data = data.rename(Index=str, columns={'interval_time': 'Timestamp'})`. Anyway this approach and your function are not suitable in pipelines – Quickbeam2k1 Jul 06 '18 at 12:31

How to store a new dataframe after using a self defined function on it?

3 Answers3

Linked