0

I have a dataset that shows the number of visits a user done during a year from each page.
For example:
0: means no visit from the page
27: means 27 times visit during a year by a user

I want to cluster the users based on their visits from pages. The problem is that more than half of the values in variables are zeros and when I plot them with a box plot the numbers greater than 20 looks like outliers. but I think they are not outliers and they are actual data because visiting a page 27 times during a year by a user is very normal.
In this scenario how can I deal with outliers?

Thanks in advance

boxplot screenshot here

  • 1
    I do not think a boxplot is the right solution for outlier detection in such zero-inflated data, please have a look at this thread: https://stats.stackexchange.com/questions/466324/how-to-identify-outliers-in-a-zero-inflated-binomial-distribution-of-count-data – L Smeets Dec 08 '20 at 08:22
  • It doesn't seem like you are asking a specific programming question. If you are seeking help analyzing or visualizing your data, then you should ask such questions over at [stats.se] instead. Otherwise describe exactly what you want to happen and provide some [example data](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that can be used to test possible solutions. – MrFlick Dec 08 '20 at 08:40

1 Answers1

0

You can bin the 0-10 visits a year, the 10-20 visits a year, etc. This presents the seemingly outliers in an effective way most of the time.

Fokke
  • 81
  • 7