Problem:
Say I have a numeric vector x
of observations of some distance (in R). For instance it could be the throwing length of a number of people.
x <- c(3,3,3,7,7,7,7,8,8,12,15,15,15,20,30)
h <- hist(x, breaks = 30, xlim = c(1,30))
I then want to define a set S
of "selectors" (ranges) that select as much of my observations as possible and at the same time span as little distance as possible (the cost is the sum of ranges in S
). Each selector si
range must be at least 3 (its resolution).
Example:
In the toy data x
I could put the first selector s1
from [6;8] which will select 4+2 observations (distance 7 and 8), use 3 distances and select 6/15 observations in total ([7;9] would give the same but for simplicity I put the selector midpoint in the max frequency). Next would be adding s2
[14;16] (6 distance and select 9/15). In summary, S
would be build along the steps:
- [6;8] (3, 6/15) #
s1
- [6;8], [14;16] (6, 9/15) #
s2
- [3;8], [14;16] (9, 12/15) #Extending
s1
(cheapest) - [3;8], [12;16] (11, 13/15) #Extending
s2
- [3;8], [12;16], [29;31], (14, 14/15) #
s3
- [3;8], [12;20], [29;31], (18, 15/15) #Extending
s2
One would stop the iterations when a certain total distance is used (sum of S
) or when a certain fraction of data is covered by S
. Or plot the sum of S
against fraction of data covered and decide from that.
For very huge data (100,000s clustered observations in 1,000,000s of distance space) I could probably be more sloppy by increasing the minimum steps allowed (above it is 1, maybe try 100) and decreasing the resolution (above its 3, one could try maybe 1000).
Since its equivalent of maximizing the area under density(x)
while minimizing the ranges of x, my intuition is that one could approximate the steps described (for time and memory considerations) using density()
and optim()
. Maybe its even a well known maximization/minimization problem.
Any suggestions that could get me started would be very appreciated.