35

I'm trying to replicate, roughly, the dplyr package from R using Python/Pandas (as a learning exercise). Something I'm stuck on is the "piping" functionality.

In R/dplyr, this is done using the pipe-operator %>%, where x %>% f(y) is equivalent to f(x, y). If possible, I would like to replicate this using infix syntax (see here).

To illustrate, consider the two functions below.

import pandas as pd

def select(df, *args):
    cols = [x for x in args]
    df = df[cols]
    return df

def rename(df, **kwargs):
    for name, value in kwargs.items():
        df = df.rename(columns={'%s' % name: '%s' % value})
    return df

The first function takes a dataframe and returns only the given columns. The second takes a dataframe, and renames the given columns. For example:

d = {'one' : [1., 2., 3., 4., 4.],
     'two' : [4., 3., 2., 1., 3.]}

df = pd.DataFrame(d)

# Keep only the 'one' column.
df = select(df, 'one')

# Rename the 'one' column to 'new_one'.
df = rename(df, one = 'new_one')

To achieve the same using pipe/infix syntax, the code would be:

df = df | select('one') \
        | rename(one = 'new_one')

So the output from the left-hand side of | gets passed as the first argument to the function on the right-hand side. Whenever I see something like this done (here, for example) it involves lambda functions. Is it possible to pipe a Pandas' dataframe between functions in the same manner?

I know Pandas has the .pipe method, but what's important to me is the syntax of the example I provided. Any help would be appreciated.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
Malthus
  • 568
  • 1
  • 7
  • 11

7 Answers7

34

It is hard to implement this using the bitwise or operator because pandas.DataFrame implements it. If you don't mind replacing | with >>, you can try this:

import pandas as pd

def select(df, *args):
    cols = [x for x in args]
    return df[cols]


def rename(df, **kwargs):
    for name, value in kwargs.items():
        df = df.rename(columns={'%s' % name: '%s' % value})
    return df


class SinkInto(object):
    def __init__(self, function, *args, **kwargs):
        self.args = args
        self.kwargs = kwargs
        self.function = function

    def __rrshift__(self, other):
        return self.function(other, *self.args, **self.kwargs)

    def __repr__(self):
        return "<SinkInto {} args={} kwargs={}>".format(
            self.function, 
            self.args, 
            self.kwargs
        )

df = pd.DataFrame({'one' : [1., 2., 3., 4., 4.],
                   'two' : [4., 3., 2., 1., 3.]})

Then you can do:

>>> df
   one  two
0    1    4
1    2    3
2    3    2
3    4    1
4    4    3

>>> df = df >> SinkInto(select, 'one') \
            >> SinkInto(rename, one='new_one')
>>> df
   new_one
0        1
1        2
2        3
3        4
4        4

In Python 3 you can abuse unicode:

>>> print('\u01c1')
ǁ
>>> ǁ = SinkInto
>>> df >> ǁ(select, 'one') >> ǁ(rename, one='new_one')
   new_one
0        1
1        2
2        3
3        4
4        4

[update]

Thanks for your response. Would it be possible to make a separate class (like SinkInto) for each function to avoid having to pass the functions as an argument?

How about a decorator?

def pipe(original):
    class PipeInto(object):
        data = {'function': original}

        def __init__(self, *args, **kwargs):
            self.data['args'] = args
            self.data['kwargs'] = kwargs

        def __rrshift__(self, other):
            return self.data['function'](
                other, 
                *self.data['args'], 
                **self.data['kwargs']
            )

    return PipeInto


@pipe
def select(df, *args):
    cols = [x for x in args]
    return df[cols]


@pipe
def rename(df, **kwargs):
    for name, value in kwargs.items():
        df = df.rename(columns={'%s' % name: '%s' % value})
    return df

Now you can decorate any function that takes a DataFrame as the first argument:

>>> df >> select('one') >> rename(one='first')
   first
0      1
1      2
2      3
3      4
4      4

Python is awesome!

I know that languages like Ruby are "so expressive" that it encourages people to write every program as new DSL, but this is kind of frowned upon in Python. Many Pythonists consider operator overloading for a different purpose as a sinful blasphemy.

[update]

User OHLÁLÁ is not impressed:

The problem with this solution is when you are trying to call the function instead of piping. – OHLÁLÁ

You can implement the dunder-call method:

def __call__(self, df):
    return df >> self

And then:

>>> select('one')(df)
   one
0  1.0
1  2.0
2  3.0
3  4.0
4  4.0

Looks like it is not easy to please OHLÁLÁ:

In that case you need to call the object explicitly:
select('one')(df) Is there a way to avoid that? – OHLÁLÁ

Well, I can think of a solution but there is a caveat: your original function must not take a second positional argument that is a pandas dataframe (keyword arguments are ok). Lets add a __new__ method to our PipeInto class inside the docorator that tests if the first argument is a dataframe, and if it is then we just call the original function with the arguments:

def __new__(cls, *args, **kwargs):
    if args and isinstance(args[0], pd.DataFrame):
        return cls.data['function'](*args, **kwargs)
    return super().__new__(cls)

It seems to work but probably there is some downside I was unable to spot.

>>> select(df, 'one')
   one
0  1.0
1  2.0
2  3.0
3  4.0
4  4.0

>>> df >> select('one')
   one
0  1.0
1  2.0
2  3.0
3  4.0
4  4.0
Paulo Scardine
  • 73,447
  • 11
  • 124
  • 153
  • Thanks for your response. Would it be possible to make a separate class (like SinkInto) for each function to avoid having to pass the functions as an argument? – Malthus Nov 12 '15 at 21:54
  • Awesome! That looks perfect, but unfortunately I'm getting an error. Here is my code: [link](http://pastebin.com/Qss4HrEK). Not sure what I'm missing. – Malthus Nov 13 '15 at 21:55
  • Sorry, I swear I tested it before posting, but now I was able to reproduce the same error you got. I've updated the answer with a working version of the decorator. – Paulo Scardine Nov 16 '15 at 11:13
  • The problem with this solution is when you are trying to call the function instead of piping. – Mokus Nov 23 '17 at 07:32
  • @OHLÁLÁ it should be easy enough to fix by adding a `__call__` method. – Paulo Scardine Nov 23 '17 at 14:04
  • @PauloScardine In that case you need to call the object explicitly: pipe= select(df, 'one') pipe() Is there a way to avoid that? – Mokus Nov 24 '17 at 08:31
  • There are some options. Not sure if a proper solution deserves its own question or if I should just amend this answer. – Paulo Scardine Nov 24 '17 at 12:49
  • I added a new question here: https://stackoverflow.com/questions/47474704/how-can-i-create-a-chain-pipeline – Mokus Nov 24 '17 at 13:51
12

While I can't help mentioning that using dplyr in Python might the closest thing to having in dplyr in Python (it has the rshift operator, but as a gimmick), I'd like to also point out that the pipe operator might only be necessary in R because of its use of generic functions rather than methods as object attributes. Method chaining gives you essentially the same without having to override operators:

dataf = (DataFrame(mtcars).
         filter('gear>=3').
         mutate(powertoweight='hp*36/wt').
         group_by('gear').
         summarize(mean_ptw='mean(powertoweight)'))

Note wrapping the chain between a pair of parenthesis lets you break it into multiple lines without the need for a trailing \ on each line.

lgautier
  • 11,363
  • 29
  • 42
  • 3
    This is nice but unfortunately, I have to do work with base Python functions/objects, and they cannot work like this. That's why I'm looking for a proper piping system. – CoderGuy123 Apr 30 '17 at 18:44
  • 1
    @Deleet Please take a look at https://github.com/sspipe/sspipe. It works with any python object. Please upvote my answer below if it satisfies your requirement. – mhsekhavat Jan 22 '19 at 18:51
8

You can use sspipe library, and use the following syntax:

from sspipe import p
df = df | p(select, 'one') \
        | p(rename, one = 'new_one')
mhsekhavat
  • 977
  • 13
  • 18
  • Does sspipe work with general python or only Pandas DFs? – alancalvitti Oct 31 '19 at 18:24
  • @alancalvitti It does support general python. – mhsekhavat Nov 10 '19 at 01:25
  • Is the 'p' necessary? A true pipe operator would allow this syntax: `df | select('one') | rename(one='new_one')`, at least where `select`, `rename` are curried. – alancalvitti Nov 11 '19 at 12:47
  • you can define `select2 = lambda x: p(select, x)` then `df | select2('one')` works. – mhsekhavat Nov 13 '19 at 13:10
  • no, because you'd have to redefine every function in python as you've shown. The entire point of pipelining is to compose existing functions. – alancalvitti Nov 13 '19 at 14:42
  • 1
    That would be impossible. Because Python needs a clue to distinguish `pipe` semantics from (bitwise-OR)[https://wiki.python.org/moin/BitwiseOperators] semantics. When you just write `x | y`, python doesn't know which one to use. – mhsekhavat Nov 15 '19 at 11:53
  • Just because `|` is used in Unix shell pipes is irrelevant- use a different symbol. Other languages can pipe and use different syntax, eg in R it's `%>%`, in Wolfram Language `/*`. – alancalvitti Nov 15 '19 at 14:56
6

I would argue strongly against doing this or any of the answers suggested here and just implement a pipe function in standard python code, without operator trickery, decorators or what not:

def pipe(first, *args):
  for fn in args:
    first = fn(first)
  return first

See my answer here for more background: https://stackoverflow.com/a/60621554/2768350

Overloading operators, involving external libraries and what not serve to make the code less readable, less maintainable, less testable and less pythonic. If I want to do some kind of pipe in python, I would not want to do more than pipe(input, fn1, fn2, fn3). Thats the most readable & robust solution I can think of. If someone in our company committed operator overloading or new dependencies to production just to do a pipe, it would get immediately reverted and they would be sentenced to doing QA checks the rest of the week :D If you really really really must use some sort of operator for pipe, then maybe you have bigger problems and Python is not the right language for your use case...

jramm
  • 6,415
  • 4
  • 34
  • 73
  • the assignment statement in your for loop is a cardinal sin of functional programming – alancalvitti Aug 19 '21 at 14:56
  • 1
    @alancalvitti a) Readability always wins over strawman 'rules' in my book b) To bring functional utlilties to non-functional languages, I am afraid that you will *have* to use non-functional concepts at some point. Just take a look at libraries like ramdajs That said, it is quite clear from my solution that a recursive alternative is possible: `pipe(first, *args): return pipe(args[0](first), args[1:]) if args else first` purists might argue that conditionals are not functional, but since python is not functional, it does not support pattern matching or alternatives – jramm Aug 20 '21 at 07:57
  • 1
    - i use a function that's similar to your above pipe, but as an operator so it's a lambda that's applied to the data, that way they can be recursively pipelined: def right_compose(*fn): return lambda x: functools.reduce(lambda f,g: g(f), list(fn),x) - no assignment necessary – alancalvitti Aug 20 '21 at 19:04
  • re conditionals, these can also be represented functionally, eg a select or cases statement that takes a lambda for pattern matching and another for rewriting. The problem is not so much functional vs not, as symbolic vs not. In Mathematica, one can match sub-expressions of a symbolic expression tree. This is not easy to do in python -but maybe the Google pyglove offers some insight – alancalvitti Aug 20 '21 at 19:08
1

I have been porting data packages (dplyr, tidyr, tibble, etc) from R in python:

https://github.com/pwwang/datar

If you are familiar with those packages in R, and want to apply it in python, then it is here for you:

from datar.all import *

d = {'one' : [1., 2., 3., 4., 4.],
     'two' : [4., 3., 2., 1., 3.]}
df = tibble(one=d['one'], two=d['two'])

df = df >> select(f.one) >> rename(new_one=f.one)
print(df)

Output:

   new_one
0      1.0
1      2.0
2      3.0
3      4.0
4      4.0
Panwen Wang
  • 3,573
  • 1
  • 18
  • 39
0

I couldn't find a built-in way of doing this, so I created a class that uses the __call__ operator because it supports *args/**kwargs:

class Pipe:
    def __init__(self, value):
        """
        Creates a new pipe with a given value.
        """
        self.value = value
    def __call__(self, func, *args, **kwargs):
        """
        Creates a new pipe with the value returned from `func` called with
        `args` and `kwargs` and it's easy to save your intermedi.
        """
        value = func(self.value, *args, **kwargs)
        return Pipe(value)

The syntax takes some getting used to, but it allows for piping.

def get(dictionary, key):
    assert isinstance(dictionary, dict)
    assert isinstance(key, str)
    return dictionary.get(key)

def keys(dictionary):
    assert isinstance(dictionary, dict)
    return dictionary.keys()

def filter_by(iterable, check):
    assert hasattr(iterable, '__iter__')
    assert callable(check)
    return [item for item in iterable if check(item)]

def update(dictionary, **kwargs):
    assert isinstance(dictionary, dict)
    dictionary.update(kwargs)
    return dictionary


x = Pipe({'a': 3, 'b': 4})(update, a=5, c=7, d=8, e=1)
y = (x
    (keys)
    (filter_by, lambda key: key in ('a', 'c', 'e', 'g'))
    (set)
    ).value
z = x(lambda dictionary: dictionary['a']).value

assert x.value == {'a': 5, 'b': 4, 'c': 7, 'd': 8, 'e': 1}
assert y == {'a', 'c', 'e'}
assert z == 5
Brett Beatty
  • 5,690
  • 1
  • 23
  • 37
0

An old question but still of interest to me (coming from R). So despite the objection of purists here is a shorty inspired by http://tomerfiliba.com/blog/Infix-Operators/

class FuncPipe:
    class Arg:
        def __init__(self, arg):
            self.arg = arg
        def __or__(self, func):
            return func(self.arg)

    def __ror__(self, arg):
        return self.Arg(arg)
pipe = FuncPipe()

Then

1 |pipe| \
  (lambda x: return x+1) |pipe| \
  (lambda x: return 2*x)

returns

4 
user3763801
  • 395
  • 4
  • 10