0

I have a data frame with 10017 observations divided in 159 financial institutions. How can I improve the normality of the distributions of each financial institution without having to go to excel and manually changing data beyond +-3SD with the values on the 1% and 99% of the distribution?

I'm new to data analysis hence I hope it is clear

I asked for tapply(df$x, df$id, quantile, (0.01,0.99)) and then I changed the outliers on Excel

Vinícius Félix
  • 8,448
  • 6
  • 16
  • 32
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. We don't need your actual data, just something to test with. – MrFlick Dec 19 '22 at 21:45

1 Answers1

0

Here an example that might help you

library(dplyr)


mtcars %>% 
  #Selection of just two variables to exemplify
  select(vs,drat) %>%
  #Grouping by vs variables
  group_by(vs) %>% 
  mutate(
    #Computing the quantiles of drat by vs
    q_01 = quantile(drat,0.01),
    q_99 = quantile(drat,0.99),
    #Changing the values to NA when they are more extreme than the quantiles
    drat = if_else(drat < q_01 | drat > q_99,NA_real_,drat)
  )


# Creating a function to change the values if they are more extreme than their quantiles
remove_quantile <- function(x){
  if_else(x < quantile(x,0.01) | x > quantile(x,0.99),NA_real_,x)
}


mtcars %>% 
  group_by(vs) %>% 
  #Applying the function across all numeric variables from the data set
  mutate(across(.cols = where(is.numeric),.fns = remove_quantile))
Vinícius Félix
  • 8,448
  • 6
  • 16
  • 32