R Tidyverse Alternative Code in Python for data wrangling

Question

I am trying to do the following in Python:

Import iris data base
create a new column with the name target_measure that will be the multiplication of the natural log of Sepal.Width times the squeared Petal.Width
create a new variable called categorical_measure that will clasify the previous column into 3 labes like so: if target_measure<1.5 then it will be: "<1.5", target_measure>=1.5 and target_measure<3.5 then it will "1.5-3.5" any other will be "out of target"
calculate the mean sepal and petal width grouping by species as well as the count of all labels in the column categorical_measure
finally filter all rows with "out of target" count is equal or greater than 5

We can download/import the iris dataset here:

data=pd.read_csv("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv")

My R code goes as follows

library(tidyverse)


data=iris # R's built-in fun to import iris


#desired output

data %>% # this is known as a pipe in R and will exc the lines below feed from the data env object
  group_by(Species) %>%  #groups by species
  mutate(target_measure=log(Sepal.Width)*(Petal.Width)^2)%>% #creates column target_measure
  mutate(categorical_measure=case_when(target_measure<1.5~"<1.5",  #creates column categorical_measure based on criteria
                                       target_measure>=1.5 & target_measure<3.5~"1.5-3.5",
                                       TRUE~"out of target")) %>% 
  summarise(mean_of_sepal=mean(Sepal.Width), #calculates mean of sepal.width of grouped data
            mean_of_petal=mean(Petal.Width),
            'No of 1.5'=sum(categorical_measure=="<1.5"), #calculates count label="<1.5" from column categorical_measure
            'No of 1.5-3.5'=sum(categorical_measure=="1.5-3.5"),#calculates count label="1.5-3.5" 
            'No of out of target'=sum(categorical_measure=="out of target")) %>%  #calculates count label="out of target"
  filter(`No of out of target`>=5) # filters desired output

code without comments (for faster reading)

data %>% 
  group_by(Species) %>% 
  mutate(target_measure=log(Sepal.Width)*(Petal.Width)^2)%>% 
  mutate(categorical_measure=case_when(target_measure<1.5~"<1.5",
                                       target_measure>=1.5 & target_measure<3.5~"1.5-3.5",
                                       TRUE~"out of target")) %>% 
  summarise(mean_of_sepal=mean(Sepal.Width),
            mean_of_petal=mean(Petal.Width),
            'No of 1.5'=sum(categorical_measure=="<1.5"),
            'No of 1.5-3.5'=sum(categorical_measure=="1.5-3.5"),
            'No of out of target'=sum(categorical_measure=="out of target")) %>% 
  filter(`No of out of target`>=5)

My desired output is:

# A tibble: 1 x 6
  Species   mean_of_sepal mean_of_petal `No of 1.5` `No of 1.5-3.5` `No of out of target`
  <fct>             <dbl>         <dbl>       <int>           <int>                 <int>
1 virginica          2.97          2.03           0              11                    39

Is there a way to achive this level of simplicity in Python?

So far I have come across the pandas library and useful functions such as data.groupby(['species']) but I alway find in each tutorial or YouTube video that each step is done separately or perhaps creating a function first and then using the .apply fun in Python but I am looking for a solution that will use pipes of some sort of structure alike.

Does this answer your question? [Functional pipes in python like %>% from R's magrittr](https://stackoverflow.com/questions/28252585/functional-pipes-in-python-like-from-rs-magrittr) — user438383, Dec 11 '22 at 22:33
You may want to use [pyjanitor](https://pyjanitor-devs.github.io/pyjanitor/). — PaulS, Dec 11 '22 at 22:38

score 2 · Accepted Answer · answered Dec 11 '22 at 23:42

2

Maybe, using pyjanitor is easier to you, but a solution based on pandas is the following, which uses chaining (similar to tidyverse piping):

df = pd.read_csv("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv")

(df.assign(target_measure=lambda x: np.log(x["sepal_width"]) * (x["petal_width"] ** 2))
 .assign(categorical_measure=lambda x: x["target_measure"].map(lambda y: "<1.5" if y < 1.5 else "1.5-3.5" if y < 3.5 else "out of target"))
 .groupby("species")
 .agg({'sepal_width': [("mean_of_sepal", 'mean')], "petal_width": [('mean_of_petal', 'mean')],
       'categorical_measure': [('No of 1.5', lambda x: sum(x == "<1.5")),
                               ('No of 1.5-3.5', lambda x: sum(x == "1.5-3.5")),
                               ('No of out of target', lambda x: sum(x == "out of target"))]}).droplevel(0, axis=1)
 .loc[lambda x: x['No of out of target'] >= 5]
 .reset_index())

Output:

     species  mean_of_sepal  mean_of_petal  No of 1.5  No of 1.5-3.5  \
0  virginica          2.974          2.026          0             11   

   No of out of target  
0                   39

answered Dec 11 '22 at 23:42

PaulS

21,159
2
9
26

1

Hey @PaulS thank you so much for this answer it really offers a way to pipe similar as R, Thank you so much for your time and effort to help me, I will like to ask you just one final question, could you please elaborate on the assign function and also why do you use so much lambda functions in the .agg portion and in the assign portion everywhere I look there is a lambda fun in python and I cant wrap my head around it thank you so much! – R_Student Dec 11 '22 at 23:55
1

Welcome, @R_Student! You can see `assign` as `mutate` in `tidyverse`. Lambda functions correspond to the anonymous functions of R. In the `assign` statements, the lambda functions are needed to pass the dataframe produced in the previous step. If you have more questions, please ask: I will be happy to answer them. Came from R to Python too! – PaulS Dec 12 '22 at 00:03
1

thank you so much, I read on your SO page that you are self-taught first thats a great achivement and second what resources would you recommend so that I can master python? any blogs? books? online sources? thanks a million – R_Student Dec 12 '22 at 00:46
1

@R_Student, I think the best way to learn Python depends on the person. In my case, I have learnt Python by trying to answer questions here on StackOverflow, not with the intent to publish my first answers, but to see how well my rudimentary answers compared with the people with answers to the same questions. In the beginning, one feels completely lost, but one *must* persist. – PaulS Dec 12 '22 at 01:02
1

Over time, one will reach a certain mastery. By feeling lost, we will be motivated to search the web for ways and functions to solve the questions -- the challenges will motivate you. Most importantly: Pay close attention to the answers by people with very high reputation, and study them. – PaulS Dec 12 '22 at 01:02

R Tidyverse Alternative Code in Python for data wrangling

1 Answers1