1

i am trying to find the solution to my problem:

how many points per group lay on the straight line

I could not find any solution for this problem in R...

Below You have a sample data and as well plot just to show you how does it look like:

data <- structure(list(Group = c(22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 22782L, 
22782L, 11553L, 11553L, 11553L, 11553L, 11553L, 7059L, 7059L, 
7059L, 7059L, 22782L), x = c(100L, 150L, 250L, 287L, 312L, 387L, 
475L, 550L, 837L, 937L, 987L, 1087L, 1175L, 1300L, 1325L, 1487L, 
1662L, 1700L, 1725L, 1812L, 1912L, 2412L, 3012L, 3562L, 4162L, 
4762L, 5362L, 5750L, 5712L, 6225L, 6825L, 6887L, 7237L, 7850L, 
7800L, 7937L, 7975L, 8275L, 8362L, 8662L, 8725L, 8950L, 9100L, 
9312L, 9400L, 9600L, 4637L, 900L, 4187L, 5800L, 7075L, 1125L, 
3400L, 3562L, 3462L, 5412L), y = c(493L, 482L, 479L, 476L, 481L, 
479L, 474L, 480L, 480L, 491L, 489L, 490L, 485L, 485L, 485L, 479L, 
482L, 482L, 482L, 482L, 484L, 489L, 491L, 489L, 496L, 498L, 500L, 
0L, 498L, 500L, 502L, 506L, 497L, 0L, 495L, 506L, 497L, 494L, 
498L, 500L, 496L, 499L, 496L, 495L, 495L, 498L, 825L, 284L, 850L, 
360L, 790L, 861L, 883L, 882L, 881L, 502L)), row.names = c(23L, 
24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 
37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 
51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 
64L, 65L, 66L, 67L, 68L, 69L, 281L, 312L, 313L, 315L, 316L, 377L, 
378L, 380L, 511L, 815L), class = "data.frame")

Data consist of group name column (3 Groups in this case), x and y coordinates:

 Group   x   y
22782 100 493
22782 150 482
22782 250 479
22782 287 476
22782 312 481

Below we can find a plot of the group 22782: enter image description here

As You can see there are many points that lay almost exactly on the same line and i would like to find out how many of them per group correspond to this condition.

Expected Output would look like this:

  Group Max Points  
  22782  20

I would appreciate any help or tips! Thanks

Mal_a
  • 3,670
  • 1
  • 27
  • 60

3 Answers3

2

Let's assume that you know only a minority of points are not on the line. You also mention that you only want to consider horizontal lines.

In that case, you can use the median as a robust estimate of the horizontal line position. You could use the mean but it may be swayed by a extreme values which are not on the line anyway.

The code is self_explanatory:

tolerance <- 10

data %>%
  group_by(Group) %>%
  mutate(y_line = median(y), 
         on_line = abs(y - y_line) <= tolerance) %>%
  count(Group, on_line)

Result:

#   Group on_line     n
#   <int> <lgl>   <int>
# 1  7059 FALSE       1
# 2  7059 TRUE        3
# 3 11553 FALSE       4
# 4 11553 TRUE        1
# 5 22782 FALSE      13
# 6 22782 TRUE       34

You can of course pipe that into filter(on_line) to keep only the count of points that are on the line.

asachet
  • 6,620
  • 2
  • 30
  • 74
  • This approach is quite inaccurate. For example, the median for Group 22782 is 491 although the line is acually at 500. Finding the true values of the lines needs only to do `with(data_group_i, labeling::extended(range(y)[1], range(y)[2], m = 5))` and therefor I wouldn't see a point not doing so –  Jun 06 '19 at 11:41
  • 1
    What makes you say the line is at 500? If you're just "looking at the plot", I am afraid that's not correct. As as I am aware, the lines do not have to be on whole multiple of 100. – asachet Jun 06 '19 at 11:43
  • The line is at 500 for this group because this is how ggplot sets breaks by default. See my answer. –  Jun 06 '19 at 11:44
  • Who cares what the background lines are if you plot them? I think you may have misinterpreted OP's question. – asachet Jun 06 '19 at 11:45
  • About what horizontal lines are you talking then if not about the lines of the plot? –  Jun 06 '19 at 11:47
  • 1
    Any line defined by `y = constant` is horizontal. We'll just let OP clarify whether they meant the lines of the background legend plot or just any horizontal lines. – asachet Jun 06 '19 at 11:49
1

Because we do not know what values the lines in ggplot have we need to find out what breaks are set by default. This is answered here and used in my code.

The following function says how many points are on the lines per group. You can further set a tolerance value what deviations from the line you accept. Further, sometimes points my lay on different lines as in the case for ggplot(subset(data, Group == 22782), aes(x=x,y=y)) + geom_point() where point lay on two different lines (0 and 500).

plot

For this case you can decide wether you want to know the sum of all points being on any line or if you are interested about the most points that are gathered about one line (here how many points are at 500). You can choose this with any_or_max_line.

The function

points.on.lines <- function(data, tolerance, any_or_max_line){
# runs the code below per group
sapply(unique(data$Group), function(group_i){
  # chooses i-th group
  data_group_i <- subset(data, Group == group_i)
# find on which y-values the lines are
line_values <- 
  with(data_group_i,
       labeling::extended(range(y)[1], range(y)[2], m = 5))
# find out per line how many points are on or around that line
points_on_lines <- sapply(line_values, function(line_values_i){
  sum(data_group_i$y >= line_values_i - tolerance &
        data_group_i$y <= line_values_i + tolerance)})
# decides whether to take into account the line with most points or all points on any line
if(any_or_max_line == "max"){
  points_on_lines <- max(points_on_lines)
} else {
  points_on_lines <- sum(points_on_lines)
}
# names results by group
names(points_on_lines) <- paste0("Group_", group_i)
return(points_on_lines)
})}

Example

points.on.lines(data= data, tolerance= 50,
                any_or_max_line= "max")
Group_22782 Group_11553  Group_7059 
     45           3           4 
  • thats an interesting Approach, however line value changes for each Group and i am not able to define every value for them (in orginal dataset there are more then 100 grups) – Mal_a Jun 06 '19 at 09:08
  • @Mal_a As I wrote in the answer, to me it is unclear what the line value is. How can I know what the line values for your 100 groups are? If you set the plot to increase y values by 10.000 stept none value will be on any line, for example. Maybe you should provide the code of your plot –  Jun 06 '19 at 09:11
  • well i am myself not able to know what the line value is, only possible way is to look at the each and every scatter plot..code of my plot is very simple: `ggplot(subset(data, Group == 22782), aes(x=x,y=y)) + geom_point()` – Mal_a Jun 06 '19 at 09:15
  • @Mal_a I completely changed the answer according to the comments. –  Jun 06 '19 at 10:04
  • Thanks for an interesting approach! It may already partially solve my Problem – Mal_a Jun 07 '19 at 06:09
1

To me this seems like an interval optimisation problem (or more generally clustering of one-dimensional Data), that is unless you have fixed breaks or lines, one way I can think of to solve such a problem is the Jenks natural breaks optimization which is already implemented in R in the package BAMMtools

You basically first fix the lines, and then see which points belong to which line (the closest line)

One parameter you have to set is the number of lines (or rather clusters), in the function getJenksBreaks.

There might be other methods to cluster those points, but here's the jenks

library(BAMMtools)
lines <- getJenksBreaks(mydata$y, 5)
lines
# [1]   0   0 360 506 883
mydata <- mydata %>% 
  rowwise() %>% 
  mutate(line_id = as.character(which.min(abs(y-unique(lines))))) 

mydata %>% 
  group_by(Group, line_id) %>% 
  summarise(cnt =n()) %>% 
  group_by(Group) %>% 
  summarise(max_points = max(cnt))
# 
# # A tibble: 3 x 2
#   Group max_points
#   <int>      <dbl>
# 1  7059          4
# 2 11553          3
# 3 22782         45

mydata %>% 
  #filter(Group == 22782) %>% 
  ggplot(aes(x,y, color = line_id)) + 
  geom_point() +
  geom_hline(yintercept = lines, 
             color = 'red', 
             #alpha = 0.5, 
             linetype ='dashed', 
             size = 0.3) +
  facet_grid(.~Group)

enter image description here

DS_UNI
  • 2,600
  • 2
  • 11
  • 22
  • This Approach Looks very interesting, however i am bit worried with Setting the number of "Clusters" as it may differ quite much – Mal_a Jun 07 '19 at 06:08