0

I would like to understand a line of code similar to that:

df.groupBy(someExpr).agg(somAgg).where(somePredicate) 

I don't get it how to chain the methods as that example using Python. I don't want to understand exactly the previous line of the code, just wanna know some name of that to investigate. I tried to replicate something similar, I'm pretty sure that it is not good implementation but I wrote that as an example of how right now I visualize how the code that I wanna get it works under the hood:

class Example:

  def __init__(self, *args):
    self.list = [arg for arg in args]

  def groupBy(self):
    self.list = [value for value in self.list if isinstance(value, int)]
    return self

  def agg(self):
    self.list = sum(self.list)
    return self

  def where(self, elem):
    self.list =  [value for value in self.list if value == elem]
    return self 

df = Example("a",1,3,3,5,"C","D")
df.groupBy().where(3).agg().list

My question is how I can to implement in the best way the methods chain? What happens if each method returns a different type of value? How can I do to remove .list in my line of code here df.groupBy().where(3).agg().list to this df.groupBy().where(3).agg()?

Eric Bellet
  • 1,732
  • 5
  • 22
  • 40
  • 2
    What is your question, _exactly_? Which part don’t you understand? – bfontaine Jun 18 '19 at 11:52
  • 1
    Seems like you have the right idea? This is just [method chaining](https://stackoverflow.com/questions/1103985/method-chaining-why-is-it-a-good-practice-or-not). See also [builder pattern](https://stackoverflow.com/questions/328496/when-would-you-use-the-builder-pattern) –  Jun 18 '19 at 11:53
  • instead of replicating I'd recommend going straight to the source code and looking at how the actual function is implemented: https://github.com/python/ you can search within the repo for what you specifically want to look at – Hugh_Kelley Jun 18 '19 at 11:53
  • there is no theorie here : self.list is modified, then self is returned, because of that you can create chain. And this is not the best solution, i would make yield - based – user8426627 Jun 18 '19 at 12:00

2 Answers2

1

So this is simply a cleaver setup of a package. Let's assume this is Pandas dataframe that is hiding under the df. In fact each function modifies the object and returns it's coppy (so df is unnecessarily modified). So this call could be translated as:

df_grouped = df.groupBy(someExpr)
df_g_aggregated = df_grouped.agg(somAgg)
df_g_a_filtered = df_g_aggregated.where(somePredicate) 

If you would look under the definitione all return the same thing, a pandas DataFrame, so each consecutive action relies on identical class. Order of actions would result in different outcome, but would be logically correct and would not result in error, because return type of group by isn't GroupedDataFrame, it's a DataFrame with addtional column, group.

So your code could look like:

class Example:

  def __init__(self, *args):
    self.list = [arg for arg in args]

  def groupBy(self, key=None):
    groups = #calculate groups for this dataset by key column
    self.list = zip(self.list, groups)
    return self

  def agg(self, key=None):
    sum = #calculate sum per each value of key column
    self.list = zip(self.list, sum)
    return self

  def where(self, key, elem):
    self.list =  #filter column key by elem
    return self 

df = Example("a",1,3,3,5,"C","D")
df.groupBy().where(3).agg().list

Naturally I won't implement here all those functions, but logic is that always return type should be the same, so if you did sum([…]) it would probably return a single integer. In my example there is also redundant in-place modification, but I hope you get the gist.

Piotr Kamoda
  • 956
  • 1
  • 9
  • 24
  • So in this line of code df.groupBy(someExpr).agg(somAgg).where(somePredicate) the method named "where" returns what? I wanna avoid the use of .list in my line of code. Also, how can I do that each method returns a different type of value? – Eric Bellet Jun 18 '19 at 12:37
  • @EricBellet method `where( )` returns a DataFrame, where all rows meet certain condition. I don't know why would you want those functions to return different types of value, this would make such chains impossible, unless you did some interface and those function would be abstract members of that interface, but then they should behave the same way - which makes implementing the interface redundant. – Piotr Kamoda Jun 18 '19 at 12:47
0

This is called method chaining. Notice that each method returns self, so the line of code you mention can be evaluated as followed:

df.groupBy().where(3).agg().list

Firstly, df.groupBy() returns df, having modified it, so this becomes:

df.where(3).agg().list

Similarly, df.where(3) returns df, having modified it, so this becomes:

df.agg().list

Finally, df.agg() returns df, also having modified it, so this becomes:

df.list

The end result is equivalent to writing:

df = Example("a",1,3,3,5,"C","D")
df.groupBy()
df.where(3)
df.agg()
df.list
Sam Rice
  • 261
  • 2
  • 5