-1

I'm looking to remove the outlier data points in the clusters after k means clustering and using this way to do so in R :-

1.)Plot the graph:-

plot(sort(df[[1]]$var))
plot(sort(df[[2]]$var))

2.)From the graph see the outlier( in my case extreme )data points.

rownames(df[[1]])<-1:nrow(df[[1]])
rownames(df[[2]])<-1:nrow(df[[2]])

3.)Go to view(df[[1]]),view(df[[2]]) sort the var in descending order and note down those row index numbers which are the outlier data points and remove those rows from df[[1]] ,df[[2]]

df[[1]]<-df[[1]][-c(200,320,216),]
df[[2]]<-df[[2]][-c(7000,1200,2320),]

df is a list with 3 elements , df[[1]] access the first element/ cluster

Is there any other easy and efficient way to achieve the same?

Quest
  • 53
  • 2
  • 8
  • Please consider a [MCVE](https://stackoverflow.com/help/minimal-reproducible-example) this may help: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Hack-R Apr 21 '20 at 19:35
  • 1
    Take a look at `?boxplot.stats`, it will identify statistical outliers in a vector. – Wil Apr 21 '20 at 19:39
  • I just want the first n( n I get from the graph) rows of a list sorted by descending order of a variable ```var ```to be removed. – Quest Apr 21 '20 at 20:17

1 Answers1

1

You need to include a short, reproducible example showing what you want and what you have tried. That said, the following may give you some hints if I'm guessing what you want correctly. Note that you can get min/max cut values from CIs or other means.

a <- 1:40
b <- a[a %in% 4:35] # Define outliers as <= 4 or >= 35
b
length(b) # Note there are no NAs using this approach

Basically cut off the outliers at the relevant outlier values and graph the remaining elements.

John Garland
  • 483
  • 3
  • 8
  • I have given my approach I am following .I want to know of ther eis an efficient way to do it. – Quest Apr 21 '20 at 19:39
  • Only after I plot the graph will I get to know my outliers .From there say I identify 3 outlier points ,I go to the df and sort in desc order of ```var``` and then remove the first 3 rows – Quest Apr 21 '20 at 19:41
  • 1
    Outlier cutpoints must be calculated, not eyeballed. R has a number of ways of calculating them depending on your specific analysis. – John Garland Apr 21 '20 at 19:44
  • 2
    "Extreme" is a statistical concept which needs to be quantified in order to make any statistical sense. It can NOT be done in a slapdash, eyeball manner. – John Garland Apr 21 '20 at 19:48
  • So the ```plot(sort(df[[1]]$var))``` will give ```sort(df[[1]]$var``` vs Index.It shoes the points which are off the main trend ( or distanced away from the other data points ) .These are what I want to remove .When I order by descending in df I get to know the data ots shown by the graph and hence remove them. – Quest Apr 21 '20 at 19:53
  • What is the definition of "off the main trend"? You still have provided zero data and zero reproducible code. I'm done until you do as there is nothing specific to talk about or to help you with. Once you have a decision rule %in% or related operators will get you the reduced set. – John Garland Apr 21 '20 at 19:55
  • If you take dummy data in a df and plot the values using my code you would probably understand.it plots a curve. – Quest Apr 21 '20 at 19:56
  • All the initial outlier treatment and NAs removal has already been done .I am just trying to remove the extreme data points from each cluster.df is a list of 3 elements.df[[1]] access the first cluster from df – Quest Apr 21 '20 at 19:57