getting rid of the certain columns in the loop

Question

outliers <- d_car_nb %>% 
  group_by(factor, segment) %>% 
  mutate(hinge_spread = 1.5*IQR(sold_fee), 
         lwr = quantile(sold_fee, .25) - hinge_spread, 
         upr = quantile(sold_fee, .75) + hinge_spread) %>%
  filter(sold_fee > upr | sold_fee < lwr)

outliers_sold_fee<-outliers %>%
  select(quotedate,factor,segment,sold_fee)
print(outliers_sold_fee)

I am not sure how to loop across this function so that fill in different KPI each time other then sold_fee and each time a new dataframe is obtained with (quotedate,factor,segment,'kpi')

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Dec 20 '21 at 19:44

score 0 · Answer 1 · answered Dec 22 '21 at 18:49

The key to the solution is to call the filter() and the select() functions of dplyr in such a way that keywords referring to variables are evaluated using SE (standard evaluation, i.e. where variables are given as strings (e.g. "sold_fee", as in outliers[,"sold_fee"] in base R)), as opposed to NSE (non-standard evaluation, where variables are given as unquoted text (e.g. sold_fee as in outliers$sold_fee in base R))(*).

NSE is the default type of evaluation in functions defined in dplyr, which makes referring to variables from a value stored in another variable (which is what you need in order to make the desired loop work as you want) not straightforward.

From the documentation for filter() and select() we deduce that the way to use SE in each of them differs, as follows:

In filter() we should use the .data pronoun. In your example, it would be:

v = "sold_fee"
filter(.data[[v]] > upr | .data[[v]] < lwr)

In select() we should use the all_of() functions. In your example it would be:

v = "sold_fee"
select(quotedate, factor, segment, all_of(v))

That said, you can now adapt your code so that the sold_fee name is read from an array containing your analysis variables and loop on them. You would then use the above usage forms for filter() and select() to obtain what you want.

In a final note, notice that you could store the result of the data frame containing the columns you want to visualize in terms of the outliers in a list and then print all at once after the loop has finished, as in:

library(dplyr)

vars4analysis = c("sold_fee")  # List all the variables you want to analyze for outliers here
outliers_info = list()
for (v in vars4analysis) {
  outliers = ...                         # filter command here
  outliers_info[[v]] = outliers %>% ...  # select command here
}
print(outliers_info)  # This will show the info about the outliers for each analysis variable

(*) You can read more about non-standard evaluation here: http://adv-r.had.co.nz/Computing-on-the-language.html

getting rid of the certain columns in the loop

1 Answers1