-1

I have following dataframe

d <- data.frame("Open" = rnorm(10,5,1) )

and if I insert outlier value

d$Open[4] = 100
d$Open[5] = 100

Now I want to replace these outlier value with normal values.

I tried replacing it with previous value but it doesn't work if there are outliers one after the other. or the very first element is an outlier.

d$Temp <- lag(d$Open ,1)
d$Temp <- ifelse(is.na(d$Temp), d$Open, d$Temp)

d$Open <- ifelse(
  ( ( d$Open - d$Temp)/ d$Temp) > 3,
  lag(d$Open, 1) ,
  d$Open
)

d <- d[,1]

Is there any better way to remove them?

Currently I have defined outlier value as any value which is 300% more than the previous value.

My definition of outlier is wrong that's why I am asking for any new method. I guess more than 100% than the median would be more appropriate for definition of outlier as mentioned in the comment. Thanks :)


My example was simple but if I use it with stock price then @Tarjae's method of identifying outlier doesn't work

dput(round(d$Open,0))
c(5, 5, 5, 5, 5, 5, 6, 6, 5, 5, 5, 6, 6, 5, 6, 7, 6, 6, 7, 6, 
6, 5, 6, 7, 8, 8, 9, 8, 7, 7, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 
5, 6, 6, 6, 5, 5, 6, 6, 7, 6, 6, 6, 6, 6, 7, 6, 6, 6, 6, 6, 6, 
5, 5, 5, 5, 5, 5, 5, 6, 5, 5, 6, 6, 6, 6, 6, 7, 7, 8, 7, 7, 7, 
7, 7, 9, 8, 8, 9, 8, 8, 10, 11, 10, 10, 10, 9, 12, 10, 9, 10, 
9, 11, 11, 11, 11, 10, 10, 10, 8, 8, 9, 8, 8, 8, 8, 7, 7, 8, 
8, 8, 8, 9, 8, 6, 8, 8, 7, 7, 7, 8, 8, 9, 10, 10, 11, 11, 10, 
10, 11, 11, 11, 11, 13, 13, 14, 13, 13, 15, 16, 16, 16, 18, 17, 
17, 17, 17, 18, 16, 15, 14, 17, 16, 17, 16, 16, 17, 18, 23, 19, 
18, 18, 17, 17, 16, 17, 18, 17, 16, 16, 16, 16, 15, 16, 16, 18, 
22, 24, 24, 30, 31, 31, 31, 32, 40, 41, 39, 38, 36, 34, 33, 33, 
38, 38, 36, 34, 34, 36, 33, 35, 34, 33, 33, 34, 33, 33, 34, 33, 
38, 37, 38, 37, 40, 42, 40, 43, 45, 46, 40, 39, 41, 33, 30, 32, 
29, 32, 30, 30, 29, 32, 31, 34, 35, 34, 34, 33, 32, 31, 32, 33, 
31, 30, 31, 36, 36, 34, 35, 36, 33, 32, 33, 32, 35, 35, 40, 40, 
42, 38, 39, 39, 36, 37, 38, 36, 38, 39, 40, 40, 40, 41, 45, 49, 
50, 50, 49, 48, 50, 49, 50, 57, 58, 54, 54, 54, 49, 49, 58, 54, 
50, 52, 70, 75, 77, 78, 82, 98, 102, 121, 129, 126, 125, 124, 
118, 138, 147, 126, 107, 97, 100, 89, 103, 93, 93, 74, 72, 68, 
82, 82, 81, 95, 98, 94, 96, 95, 94, 90, 85, 91, 76, 72, 72, 78, 
81, 83, 85, 84, 80, 80, 79, 78, 65, 63, 46, 55, 41, 36, 32, 40, 
33, 24, 21, 19, 21, 30, 27, 28, 30, 25, 24, 21, 24, 25, 24, 22, 
22, 20, 23, 25, 27, 30, 34, 34, 37, 35, 41, 49, 58, 55, 51, 54, 
52, 51, 50, 41, 44, 51, 54, 48, 52, 54, 54, 57, 77, 70, 69, 68, 
68, 74, 70, 62, 74, 73, 74, 73, 72, 76, 73, 79, 82, 88, 91, 90, 
82, 81, 82, 86, 88, 94, 91, 95, 97, 103, 102, 94, 102, 93, 87, 
79, 79, 80, 77, 80, 83, 86, 81, 82, 80, 84, 85, 83, 83, 84, 85, 
92, 88, 90, 88, 91, 93, 90, 91, 87, 84, 89, 81, 75, 76, 70, 68, 
71, 74, 68, 68, 64, 54, 58, 59, 58, 52, 52, 52, 54, 61, 59, 57, 
64, 63, 64, 60, 58, 54, 55, 55, 56, 51, 54, 56, 52, 51, 49, 41, 
39, 38, 37, 37, 35, 38, 36, 34, 32, 32, 31, 33, 32, 32, 28, 25, 
25, 24, 22, 23, 23, 23, 26, 31, 31, 32, 32, 37, 36, 37, 36, 36, 
34, 32, 32, 33, 31, 30, 30, 27, 24, 25, 25, 25, 24, 25, 30, 30, 
30, 31, 29, 27, 27, 27, 27, 24, 24, 22, 24, 25, 26, 28, 30, 30, 
29, 28, 30, 31, 31, 33, 34, 34, 34, 33, 35, 32, 30, 30, 31, 29, 
27, 27, 26, 26, 23, 21, 24, 25, 26, 25, 24, 27, 25, 25, 24, 24, 
24, 23, 22, 24, 24, 25, 24, 23, 22, 22, 20, 21, 21, 22, 22, 22, 
21, 23, 24, 24, 25, 26, 26, 28, 26, 27, 27, 27, 30, 30, 33, 36, 
33, 31, 30, 32, 31, 33, 30, 30, 29, 34, 35, 37, 40, 37, 37, 37, 
37, 41, 42, 41, 44, 41, 39, 38, 43, 38, 40, 39, 42, 44, 43, 39, 
41, 40, 38, 34, 24, 24, 23, 25, 27, 28, 27, 32, 28, 27, 29, 26, 
27, 26, 29, 28, 28, 29, 31, 28, 28, 31, 29, 28, 25, 26, 23, 23, 
24, 23, 20, 21, 20, 21, 20, 18, 17, 18, 21, 19, 19, 19, 21, 19, 
19, 19, 17, 16, 15, 14, 16, 15, 14, 15, 17, 17, 18, 16, 15, 15, 
14, 13, 14, 12, 12, 12, 12, 11, 9, 8, 9, 9, 8, 8, 7, 10, 11, 
12, 11, 11, 12, 14, 14, 15, 14, 14, 12, 13, 12, 12, 14, 15, 17, 
16, 16, 16, 17, 17, 15, 15, 13, 14, 13, 13, 13, 13, 15, 14, 15, 
17, 16, 14, 12, 15, 16, 15, 16, 15, 15, 18, 21, 20, 19, 19, 18, 
19, 18, 19, 18, 18, 18, 18, 21, 20, 22, 20, 20, 20, 19, 18, 18, 
19, 20, 18, 19, 19, 20, 21, 23, 22, 20, 20, 19, 20, 22, 23, 25, 
25, 28, 26, 27, 26, 25, 25, 24, 23, 22, 23, 21, 23, 24, 26, 30, 
28, 27, 20, 23, 21, 24, 22, 19, 19, 18, 20, 25, 24, 25, 23, 24, 
23, 23, 24, 22, 32, 29, 28, 27, 27, 24, 25, 28, 28, 29, 28, 28, 
30, 34, 33, 32, 29, 26, 29, 27, 27, 41, 43, 42, 40, 41, 34, 36, 
37, 36, 36, 33, 34, 32, 32, 30, 29, 33, 34, 36, 34, 37, 41, 40, 
36, 33, 33, 32, 32, 33, 33, 32, 31, 30, 27, 30, 30, 29, 26, 31, 
26, 26, 23, 23, 27, 27, 29, 26, 27, 27, 26, 28, 29, 31, 33, 31, 
31, 29, 27, 28, 28, 27, 27, 28, 27, 26, 25, 26, 24, 25, 24, 20, 
17, 14, 16, 15, 15, 15, 14, 15, 14, 14, 14, 15, 16, 15, 21, 20, 
19, 19, 18, 18, 18, 22, 26, 26, 28, 27, 28, 24, 25, 24, 22, 22, 
22, 21, 24, 26, 27, 26, 26, 27, 27, 34, 37, 34, 32, 30, 32, 34, 
30, *2900*, 31, 36, 32, 34, 33, 37, 37, 37, 42, 57, 52, 54, 52, 
51)

In the above data there is only one outlier which is 2900

Stupid_Intern
  • 3,382
  • 8
  • 37
  • 74
  • 2
    Why is the outlier depending on the previous value, when you generate the values using `rnorm`? Why you don't use something like "X% more than the median, or mean, value"? Then you can spot all outlier values, then the largest non-outlier value and then replace all outliers with that largest non-outlier value. – AntoniosK May 29 '21 at 08:53
  • The definition given for outlier is confusing? Why compare it with previous value only? If at all, you are comparing with previous value, there would be no circumstances leading to first value as outlier! Moreover, how to proceed for third value test for outlier, if in case your second value is an outlier in comparison with first value. Instead of creating a random case, it is suggested that you may define your logic on some complex case such as `c(1, 300, 3000, 2, 2000)` . can you please tell what your desired output be in this case? – AnilGoyal May 29 '21 at 09:03
  • You can have a look at options here. - https://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset/4788102#4788102 – Ronak Shah May 29 '21 at 09:32
  • In your revised data, I have 2 further questions. 1) What if your values follow a decreasing trend. If all values are positive there can be no value which is 300% lower than its previous value? 2)value followed by `2900` is `31` which is more than significantly lower than `2900` be considered an outlier? – AnilGoyal May 29 '21 at 10:10
  • You should never simply replace "outliers". Doing so invalidates your whole data analysis. – Roland May 29 '21 at 10:31
  • @Roland these values were a mistake from yahoo finance api. In that case also I shouldn't remove them? It will definitely make sense to remove the outliers in this case and replace them with nearest normal value. – Stupid_Intern May 29 '21 at 10:58
  • Removing is fine then. Replacing most likely not. Look into proper imputation techniques. I suspect you are dealing with time series. – Roland May 29 '21 at 11:20

1 Answers1

3
# Your dataframe
df <- data.frame("Open" = rnorm(10,5,1) )

# Adding outliers
df$Open[4] = 100
df$Open[5] = 100

# Visualize outliers
boxplot(d)

# create a vector of outliers for the numeric factor
outliers <- boxplot(df$Open, plot = FALSE)$out

# Replace the outliers with NA (or whatever you want)
df[df$Open %in% outliers, "Open"] = NA
df

Output:

       Open
1  4.589664
2  4.621286
3  7.317407
4        NA
5        NA
6  3.490202
7  3.536626
8  2.825471
9  5.710270
10 5.541880
TarJae
  • 72,363
  • 6
  • 19
  • 66