0

I am learning R and I have a dataset where I have the variables "annoyTruck_transf", "annoyCar_transf" and "dBA" and three other variables which are not relevant to this question. Said variables are interval scaled. I put them in a boxplot where the y-axis displays the annoyance level of Car or Truck sounds and the x-axis displays the volume in dBA. You can see the boxplot in the picture. As you can see there are some outliers for both boxplots. I have already done the analysis with the outliers but now I wanna try doing the analysis without the outliers. How do I remove them from the dataset? I have already googled the problem but I do not understand the solutions, especially when the people are using different names. I am glad about any help.

EDIT: This was one of my codes to remove the outliers. I am not sure if it is correct since most outliers were removed but four new outliers appeared for Cars, at least for trucks there are no outliers anymore. However, I do not know how many outliers were removed now and how many values are left. How can I check it?

list_quantiles <- tapply(d2_nocars$annoyCar_transf, d2_nocars$dBA, 
quantile)

Q1s <- sapply(1:17, function(i) list_quantiles[[i]][2])
Q3s <- sapply(1:17, function(i) list_quantiles[[i]][4])

IQRs <- tapply(d2_nocars$annoyCar_transf, d2_nocars$dBA, IQR)

Lowers <- Q1s - 1.5*IQRs
Uppers <- Q3s + 1.5*IQRs

datas <- split(d2_nocars, d2_nocars$dBA)

data_no_outlier <- NULL
for (i in 1:17){
out <- subset(datas[[i]], datas[[i]]$annoyCar_transf > Lowers[i]
            & datas[[i]]$annoyCar_transf < Uppers[I])
data_no_outlier <- rbind(data_no_outlier, out)
}

#Now we exclude the outliers from annoy_Truck_transf

list_quantiles2 <- tapply(data_no_outlier$annoyTruck_transf, 
data_no_outlier$dBA, quantile)

Q1s2 <- sapply(1:17, function(i) list_quantiles2[[i]][2])
Q3s2 <- sapply(1:17, function(i) list_quantiles2[[i]][4])

IQRs2 <- tapply(data_no_outlier$annoyTruck_transf, 
data_no_outlier$dBA, IQR)

Lowers2 <- Q1s2 - 1.5*IQRs2
Uppers2 <- Q3s2 + 1.5*IQRs2

datas2 <- split(d2_nocars, d2_nocars$dBA)

data_no_outlier2 <- NULL
for (i in 1:17){
 out2 <- subset(datas2[[i]], datas2[[i]]$annoyTruck_transf > 
Lowers2[i]
 & datas2[[i]]$annoyTruck_transf < Uppers2[I])
 data_no_outlier2 <- rbind(data_no_outlier2, out2)
}

Boxplots: enter image description here

Phazon0
  • 1
  • 1
  • Note that after you remove those outliers, you may wind up with new "outliers" if you re-plot the data since you've changed the distribution. There's no clear definition of what exactly is an "outlier" is. If you want to use the definition of something outside 1.5 times the inter-quartile range, then note that when you remove points, you change the quartile range. It all depends on your assumptions about your data and how it was collected. If you want advice about dealing with outliers, you might ask for help at [stats.se] instead. – MrFlick Dec 19 '22 at 19:30
  • Thank you. I was already thinking about the possibility of new outliers emerging. I added one of my codes where this case happened. – Phazon0 Dec 19 '22 at 20:59

0 Answers0