1

I am just starting to use user-defined functions, so this is probably not a very complex question, forgive me.

I have a few dataframes, which all have a column named 'interval_time' (for example) and I would like to rename this column 'Timestamp', and then make this renamed column into the index.

I know that I can do this manually with this;

df = df.rename(index=str, columns={'interval_time': 'Timestamp'})
df = df.set_index('Timestamp')

but now I would like to define a function called rename that does this for me. I have seen that this works;

def rename_col(data, col_in='tempus_interval_time', col_out='Timestamp'):
    return data.rename(index=str, columns={col_in: col_out}, inplace=True)

but when I try to add the second function it does not seem to do anything, but if I define the second part as its own function and run it it does seem to work.

I am trying this

def rename_n_index(data, col_in='tempus_interval_time', col_out='Timestamp'):
    return data.rename(index=str, columns={col_in: col_out}, inplace=True)
    return data.set_index('Timestamp', inplace=True)

The dataframes that I am using have the following form;

df_scada
              interval_time                 A         ...             X                 Y 
0       2010-11-01 00:00:00                0.0        ...                396.36710         381.68860
1       2010-11-01 00:05:00                0.0        ...                392.97974         381.40634
2       2010-11-01 00:10:00                0.0        ...                390.15695         379.99493
3       2010-11-01 00:15:00                0.0        ...                389.02786         379.14810
roganjosh
  • 12,594
  • 4
  • 29
  • 46
Luka Vlaskalic
  • 445
  • 1
  • 3
  • 19
  • Have you tried chaining them together? `return df.rename(...).set_index(...)` – dashiell Jul 06 '18 at 14:44
  • When a return statement gets evaluated in python, it quits out of the function call. Any further return statements are ignored. A common way to return multiple objects at once is to return a tuple containing the objects. However as the answer by Martijn points out, you don't have to return anything if you are modifying objects in place. – tobsecret Jul 06 '18 at 14:48

2 Answers2

4

You don't need to return anything, because your operations are done in place. You can do the in-place changes in your function:

def rename_n_index(data, col_in='tempus_interval_time', col_out='Timestamp'):
    data.rename(index=str, columns={col_in: col_out}, inplace=True)
    data.set_index('Timestamp', inplace=True)

and any other references to the dataframe you pass into the function will see the changes made:

>>> import pandas as pd
>>> df = pd.DataFrame({'interval_time': pd.to_datetime(['2010-11-01 00:00:00', '2010-11-01 00:05:00', '2010-11-01 00:10:00', '2010-11-01 00:15:00']),
...     'A': [0.0] * 4}, index=range(4))
>>> df
     A       interval_time
0  0.0 2010-11-01 00:00:00
1  0.0 2010-11-01 00:05:00
2  0.0 2010-11-01 00:10:00
3  0.0 2010-11-01 00:15:00
>>> def rename_n_index(data, col_in='tempus_interval_time', col_out='Timestamp'):
...     data.rename(index=str, columns={col_in: col_out}, inplace=True)
...     data.set_index('Timestamp', inplace=True)
...
>>> rename_n_index(df, 'interval_time')
>>> df
                       A
Timestamp
2010-11-01 00:00:00  0.0
2010-11-01 00:05:00  0.0
2010-11-01 00:10:00  0.0
2010-11-01 00:15:00  0.0

In the above example, the df reference to the dataframe shows the changes made by the function.

If you remove the inplace=True arguments, the method calls return a new dataframe object. You can store an intermediate result as a local variable, then apply the second method to the dataframe referenced in that local variable:

def rename_n_index(data, col_in='tempus_interval_time', col_out='Timestamp'):
    renamed = data.rename(index=str, columns={col_in: col_out})
    return renamed.set_index('Timestamp')

or you can chain the method calls directly to the returned object:

def rename_n_index(data, col_in='tempus_interval_time', col_out='Timestamp'):
    return data.rename(index=str, columns={col_in: col_out})\
               .set_index('Timestamp'))

Because renamed is already a new dataframe, you can apply the set_index() call in-place to that object, then return just renamed, as well:

def rename_n_index(data, col_in='tempus_interval_time', col_out='Timestamp'):
    renamed = data.rename(index=str, columns={col_in: col_out})
    renamed.set_index('Timestamp', inplace=True)
    return renamed

Either way, this returns a new dataframe object, leaving the original dataframe unchanged:

>>> def rename_n_index(data, col_in='tempus_interval_time', col_out='Timestamp'):
...     renamed = data.rename(index=str, columns={col_in: col_out})
...     return renamed.set_index('Timestamp')
...
>>> df = pd.DataFrame({'interval_time': pd.to_datetime(['2010-11-01 00:00:00', '2010-11-01 00:05:00', '2010-11-01 00:10:00', '2010-11-01 00:15:00']),
...     'A': [0.0] * 4}, index=range(4))
>>> rename_n_index(df, 'interval_time')
                       A
Timestamp
2010-11-01 00:00:00  0.0
2010-11-01 00:05:00  0.0
2010-11-01 00:10:00  0.0
2010-11-01 00:15:00  0.0
>>> df
     A       interval_time
0  0.0 2010-11-01 00:00:00
1  0.0 2010-11-01 00:05:00
2  0.0 2010-11-01 00:10:00
3  0.0 2010-11-01 00:15:00
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Method chaining is a 4th possibility, i.e. `renamed = data.rename(...)\ .set_index(...)`. Some find it aesthetically pleasing to see method calls visually aligned. – jpp Jul 06 '18 at 14:48
  • 1
    @jpp: yes, but so ugly I don't know if I want to go there ;-) – Martijn Pieters Jul 06 '18 at 14:49
2

See @MartijnPieters' explanation for resolving the errors in your code.

However, note that the Pandorable method is to use method chaining. Some find it aesthetically pleasing to see method names visually aligned. Here's an example:

def rename_n_index(data, col_in='tempus_interval_time', col_out='Timestamp'):

    renamed = data.rename(index=str, columns={col_in: col_out})\
                  .set_index('Timestamp')

    return renamed

Then to apply these to a sequence of dataframes as in your previous question:

dfs = [df.pipe(rename_n_index) for df in (df1, df2, df3)]
jpp
  • 159,742
  • 34
  • 281
  • 339