1

I want to delete outliers (defined as more than 2 standard deviations from the mean) in the residual plots ?

what command should I write?

DF.mod.2<- lm(X ~ A+ B+ C+ D+ F, data=DF)

I got the mean of residuals by this command:

mean(resid(DF.mod.2))

and standard deviation by this command:

sqrt(deviance(DF.mod.2)/df.residual(DF.mod.2))

Then how could I exclude the residuals (which have distance more than 2 standard deviations from the mean) from my data frame?

the residual plots:

enter image description here

Please help me... I am working on these data for one week and I do not know how to remove outliers! I am allowed to remove 200 outliers (not more).

  • 2
    Please edit your post with some example data. For tips on creating reproducible examples in R, see [this question](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – x4nd3r Oct 08 '14 at 00:12
  • 2
    if you had >200 points qualifying as outliers, how would you decide which ones to exclude? – Ben Bolker Oct 08 '14 at 00:22
  • I do not know! do you have any recommendation? I even do not know how many outliers my data frame has! – Potential Scientist Oct 08 '14 at 00:39

3 Answers3

5

There are a variety of ways you can do this. A simple strategy is to first save your residuals to the data.frame as a new column. Then you can add a second new column to flag if a residual is an outlier or not. You can then use that column to either make a new data.frame without outliers or subset your current data.frame or whatever else you need. Here is an example:

set.seed(20) #sets the random number seed.

# Test data and test linear model
DF<-data.frame(X=rnorm(200), Y=rnorm(200), Z=rnorm(200))
LM<-lm(X~Y+Z, data=DF)

# Store the residuals as a new column in DF
DF$Resid<-resid(LM)

# Find out what 2 standard deviations is and save it to SD2
SD2<-2*sd(resid(LM))
SD2
#[1] 1.934118

# Make DF$Outs 1 if a residual is 2 st. deviations from the mean, 0 otherwise
DF$Outs<-ifelse(abs(DF$Resid)>SD2, 1, 0)

# Plot this, note that DF$Outs is used to set the color of the points.
plot(DF$Resid, col=DF$Outs+1, pch=16,ylim=c(-3,3))

Plot with outliers

#Make a new data.frame with no outliers
DF2<-DF[!DF$Outs,]
nrow(DF2)
#[1] 189    Was 200 before, 11 outliers removed

# Plot new data
plot(DF2$Resid, col=DF2$Outs+1,pch=16, ylim=c(-3,3))

Plot with outliers removed

That is the basic idea. You can combine some of these commands - you could just create the outliers column without saving SD2 for instance, and you don't really need two data.frames - you could just exclude the outliers rows when you need to.

John Paul
  • 12,196
  • 6
  • 55
  • 75
4

I think you can filter using something like :

z[abs(z-mean(z))<2*sd(z)])

Where z is resid(DF.mod.2)

agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Thank you very much. How should I write this in RStudio? First I should write: z<- resid(DF.md.2) then I should write z[abs(z-mean(z))<2*sd(z)]) ? If I want to plot residuals I should use this command: z<- lm(X ~ A+ B+ C+ D+ F, data=DF) and residualPlots(z) ? Am I right? – Potential Scientist Oct 08 '14 at 00:25
  • I wrote those commands. would you please tell me how to draw the residual plot of filtered data farme? – Potential Scientist Oct 08 '14 at 01:11
  • 5
    with respect, @PotentialScientist, you seem a little out of your depth. You might need to get some local help or sit down and work through an introduction to R -- Stack Overflow is good for getting over occasional hurdles, but it can't teach you a new programming language ... – Ben Bolker Oct 08 '14 at 01:59
0
df = read.csv("SalesData.csv")

LinReg = lm(data = df, Sales ~ OrderCount)

standardized_residuals = scale(LinReg$residuals)
df_outlier = ifelse(abs(standardized_residuals > 2), 1, 0)
df_inliers = df[ !df_outlier,]

LinReg = lm(data = df_inliers, Sales ~ OrderCount)