6

What's the most fluent (or easy to read) method chaining solution for transforming columns in Pandas?

(“method chaining” or “fluent” is the coding style made popular by Tom Augspurger among others.)

For the sake of the example, let's set up some example data:

import pandas as pd
import seaborn as sns

df = sns.load_dataset("iris").astype(str)  # Just for this example
df.loc[1, :] = "NA"

df.head()
# 
#   sepal_length sepal_width petal_length petal_width species
# 0          5.1         3.5          1.4         0.2  setosa
# 1           NA          NA           NA          NA      NA
# 2          4.7         3.2          1.3         0.2  setosa
# 3          4.6         3.1          1.5         0.2  setosa
# 4          5.0         3.6          1.4         0.2  setosa

Just for this example: I want to map certain columns through a function - sepal_length using pd.to_numeric - while keeping the other columns as they were. What's the easiest way to do that in a method chaining style?

I can already use assign, but I'm repeating the column name here, which I don't want.

new_result = (
        df.assign(sepal_length = lambda df_: pd.to_numeric(df_.sepal_length, errors="coerce"))
          .head()  # Further chaining methods, what it may be
    )

I can use transform, but transform drops(!) the unmentioned columns. Transform with passthrough for the other columns would be ideal:

# Columns not mentioned in transform are lost
new_result = (
        df.transform({'sepal_length': lambda series: pd.to_numeric(series, errors="coerce")})
          .head()  # Further chaining methods...
    )

Is there a “best” way to apply transformations to certain columns, in a fluent style, and pass the other columns along?


Edit: Below this line, a suggestion after reading Laurent's ideas.

Add a helper function that allows applying a mapping to just one column:

import functools

coerce_numeric = functools.partial(pd.to_numeric, errors='coerce')

def on_column(column, mapping):
    """
    Adaptor that takes a column transformation and returns a "whole dataframe" function suitable for .pipe()
    
    Notice that columns take the name of the returned series, if applicable
    Columns mapped to None are removed from the result.
    """
    def on_column_(df):
        df = df.copy(deep=False)
        res = mapping(df[column])
        # drop column if mapped to None
        if res is None:
            df.pop(column)
            return df
        df[column] = res
        # update column name if mapper changes its name
        if hasattr(res, 'name') and res.name != col:
            df = df.rename(columns={column: res.name})
        return df
    return on_column_

This now allows the following neat chaining in the previous example:

new_result = (
        df.pipe(on_column('sepal_length', coerce_numeric))
          .head()  # Further chaining methods...
    )

However, I'm still open to ways how to do this just in native pandas without the glue code.


Edit 2 to further adapt Laurent's ideas, as an alternative. Self-contained example:

import pandas as pd

df = pd.DataFrame(
    {"col1": ["4", "1", "3", "2"], "col2": [9, 7, 6, 5], "col3": ["w", "z", "x", "y"]}
)

def map_columns(mapping=None, /, **kwargs):
    """
    Transform the specified columns and let the rest pass through.
    
    Examples:
    
        df.pipe(map_columns(a=lambda x: x + 1, b=str.upper))
        
        # dict for non-string column names
        df.pipe({(0, 0): np.sqrt, (0, 1): np.log10})
    """
    if mapping is not None and kwargs:
        raise ValueError("Only one of a dict and kwargs can be used at the same time")
    mapping = mapping or kwargs
    
    def map_columns_(df: pd.DataFrame) -> pd.DataFrame:
        mapping_funcs = {**{k: lambda x: x for k in df.columns}, **mapping}
        # preserve original order of columns
        return df.transform({key: mapping_funcs[key] for key in df.columns})
    return map_columns_


df2 = (
    df
    .pipe(map_columns(col2=pd.to_numeric))
    .sort_values(by="col1")
    .pipe(map_columns(col1=lambda x: x.astype(str) + "0"))
    .pipe(map_columns({'col2': lambda x: -x, 'col3': str.upper}))
    .reset_index(drop=True)
)

df2

#   col1    col2    col3
# 0     10  -7  Z
# 1     20  -5  Y
# 2     30  -6  X
# 3     40  -9  W
creanion
  • 2,319
  • 2
  • 13
  • 17
  • does [this](https://stackoverflow.com/questions/36814100/pandas-to-numeric-for-multiple-columns) question and the several answers help ? – Rabinzel May 14 '22 at 10:46
  • 1
    Thank you, but it doesn't really address the fluent column transformation in general – creanion May 14 '22 at 16:54
  • I thought that's what you were asking for. apply a function (eg. pd.to_numeric) to multiple columns (specific ones, but not all). in the link I sent they did exactly that, either by make a list of columns to choose and then with `df[cols].apply` with axis=1 or for `col in cols` or they group columns together with same type. in all of the solution the other columns maintain. maybe I don't get the term "fluent column transformation", then I'm sorry :P – Rabinzel May 14 '22 at 17:06
  • 1
    I've added a link for method chaining. To find a solution in that style is the main focus of the question. I know how to solve the problem in general, and the solutions in your link are certainly helpful, but not what I'm after. – creanion May 14 '22 at 17:10

2 Answers2

3

Here is my take on your interesting question.

I don't know of a more idiomatic way in Pandas to do method chaining than combining pipe, assign, or transform. But I understand that "transform with passthrough for the other columns would be ideal".

So, I suggest using it with a higher-order function to deal with other columns, doing even more functional-like coding by taking advantage of Python standard library functools module.

For example, with the following toy dataframe:

df = pd.DataFrame(
    {"col1": ["4", "1", "3", "2"], "col2": [9, 7, 6, 5], "col3": ["w", "z", "x", "y"]}
)

You can define the following partial object:

from functools import partial
from typing import Any, Callable
import pandas as pd

def helper(df: pd.DataFrame, col: str, method: Callable[..., Any]) -> pd.DataFrame:
    funcs = {col: method} | {k: lambda x: x for k in df.columns if k != col}
    # preserve original order of columns
    return {key: funcs[key] for key in df.columns}

on = partial(helper, df)

And then do all sorts of chain assignments using transform, for instance:

df = (
    df
    .transform(on("col1", pd.to_numeric))
    .sort_values(by="col1")
    .transform(on("col2", lambda x: x.astype(str) + "0"))
    .transform(on("col3", str.upper))
    .reset_index(drop=True)
)

print(df)
# Ouput
   col1 col2 col3
0     1   70    Z
1     2   50    Y
2     3   60    X
3     4   90    W
Laurent
  • 12,287
  • 7
  • 21
  • 37
  • Great answer! @Laurent I really liked the partial approach. – Shubham Sharma May 15 '22 at 17:01
  • Interesting ideas! Do you mind if I take this and try to change it into something I'd like to use? I'd like to see if I can avoid the partial (The dataframe should be as it is at the point in the chain, not frozen in the start) - and preferably keep the order of columns at all times - order in == order out is important to me. – creanion May 15 '22 at 17:11
  • 1
    Thanks @ShubhamSharma. @creanion I've edited my answer to tweak the helper function so that order of columns is preserved. You can do without `partial`, it was mainly to avoid repeating df, in the DRY spirit you seek. And since df is a mutable object, it is not really 'frozen' after the initial partial assignment. But it's up to you. Cheers. – Laurent May 15 '22 at 17:27
  • `df` won't reflect the result of the method chain, *at that point in the chain*, though, so it doesn't work well with all fluent code. Nothing personal, just looking at the end goal here. With inspiration from you, I've posted another helper function in the question. – creanion May 15 '22 at 17:32
  • Right. This is getting more and more interesting, cheers! – Laurent May 15 '22 at 17:43
0

If I understand the question correctly, perhaps using ** within assign will be helpful. For example, if you just wanted to convert the numeric data types using pd.to_numeric the following should work.

df.assign(**df.select_dtypes(include=np.number).apply(pd.to_numeric,errors='coerce'))

By unpacking the df, you are essentially giving assign what it needs to assign each column. This would be equivalent to writing sepal_length = pd.to_numeric(df['sepal_length'],errors='coerce'), sepal_width = ... for each column.

rhug123
  • 7,893
  • 1
  • 9
  • 24