2

I have a data frame where there are some duplicated x values. Those values are associated with a range. The values that are duplicated (duplicated x values) and that have an overlapping range (both lines are overlapping in low and up column) should be deleted, but I want to keep the values that they refer to (the val colum).

library(dplyr)
df = data.frame(x=c("A","A","A","B","B","B","C"),
           low = c(-10,-5,100,100,200,300,10),
           up = c(2,3,200,150,250,350,20), 
           val = c(1,2,150,125,225,325,15))
df
  x low  up val
1 A -10   2   1
2 A  -5   3   2
3 A 100 200 150
4 B 100 150 125
5 B 200 250 225
6 B 300 350 325
7 C  10  20  15

As you can see, line 1 and 2 are overlapping. Looking at the example here. I tried to write a dplyr code, but it's not giving me the expected result. The difference with the example is that I'm not "merging" 2 datasets. So I don't know how to test different ranges within a dataset that contains duplicated values. I was thinking to group the duplicated values in the data set and test them individually. But since it's not integrated in the dplyr function, I think it's not working.

df.gr = df %>% 
  group_by(x)

df.gr[with(df.gr, low <= up),]

In the end, I want something like this

  x low  up val
1 A -10   3 1,2
2 A 100 200 150
3 B 100 150 125
4 B 200 250 225
5 B 300 350 325
6 C  10  20  15

As you can see, for the range that are overlapping, I'm keeping the lowest value, and the highest of the 2 ranges. In addition, I want to "remember" the values in the column "val". That's why I want to end up with 1,2 in the range that is overlapping.

Also, in the example I'm citing here, there was only 2 ranges for each duplicated values. In my case I could have 2 duplicated values or more. I want to test any duplicated value and test their range and see if they overlap.

For example,

df = data.frame(x=c("A","A","A","A","B","B","B","C"),
           low = c(-10,-5,-2,100,100,200,300,10),
           up = c(2,3,4,200,150,250,350,20), 
           val = c(1,2,3,150,125,225,325,15))

df
  x low  up val
1 A -10   2   1
2 A  -5   3   2
3 A  -2   4   3
4 A 100 200 150
5 B 100 150 125
6 B 200 250 225
7 B 300 350 325
8 C  10  20  15

Would give:

  x low  up   val
1 A -10   4 1,2,3
2 A 100 200   150
3 B 100 150   125
4 B 200 250   225
5 B 300 350   325
6 C  10  20    15

I've also tried to use the spread function in tidy without luck:

tidyr::spread(df,x,val)
  low  up   A   B  C
1 -10   2   1  NA NA
2  -5   3   2  NA NA
3  -2   4   3  NA NA
4  10  20  NA  NA 15
5 100 150  NA 125 NA
6 100 200 150  NA NA
7 200 250  NA 225 NA
8 300 350  NA 325 NA

Basically, if I want to use this I would have to put as key both the low and up columns, but I can't do that. Also, the number of overlap for each duplicated values would make the size of the datagram variable when it's creating the different columns. So I don't know how to proceed with this...

M. Beausoleil
  • 3,141
  • 6
  • 29
  • 61

2 Answers2

1

Edit: see edit history for complicated recursive solution

Here is a simple solution where we create a boolean which identifies a change of group, cumsum on this boolean give us a group identifier, we use group_by on this identifier and then summarize the values.

library(dplyr)
# Example 1
df = data.frame(x=c("A","A","A","B","B","B","C"),
                low = c(-10,-5,100,100,200,300,10),
                up = c(2,3,200,150,250,350,20), 
                val = c(1,2,150,125,225,325,15))

df %>% arrange(x,low) %>%
  group_by(x,set = cumsum(c(TRUE,x[-1] != x[-n()] | low[-1] > up[-n()]))) %>%
  summarize(low=min(low),up=max(up),val=lst(val)) %>%
  print.data.frame

#   x set low  up  val
# 1 A   1 -10   3 1, 2
# 2 A   2 100 200  150
# 3 B   3 100 150  125
# 4 B   4 200 250  225
# 5 B   5 300 350  325
# 6 C   6  10  20   15

# Example 2
df = data.frame(x=c("A","A","A","A","B","B","B","C"),
                low = c(-10,-5,-2,100,100,200,300,10),
                up = c(2,3,4,200,150,250,350,20), 
                val = c(1,2,3,150,125,225,325,15))

df %>% arrange(x,low) %>%
  group_by(x,set = cumsum(c(TRUE,x[-1] != x[-n()] | low[-1] > up[-n()]))) %>%
  summarize(low=min(low),up=max(up),val=lst(val)) %>%
  print.data.frame

#   x set low  up     val
# 1 A   1 -10   4 1, 2, 3
# 2 A   2 100 200     150
# 3 B   3 100 150     125
# 4 B   4 200 250     225
# 5 B   5 300 350     325
# 6 C   6  10  20      15
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
0

This could work. I believe you need to use "collapse" within the paste command.

output <- df %>% group_by(x) %>% summarise(low = min(low), up = max(up), values = paste(val,collapse=","))

This outputs this df:

structure(list(x = structure(1:3, .Label = c("A", "B", "C"), class = "factor"), 
    low = c(-10, 100, 10), up = c(200, 350, 20), values = c("1,2,3,150", 
    "125,225,325", "15")), class = c("tbl_df", "tbl", "data.frame"
), .Names = c("x", "low", "up", "values"), row.names = c(NA, 
-3L))

# A tibble: 3 x 4
       x   low    up      values
  <fctr> <dbl> <dbl>       <chr>
1      A   -10   200   1,2,3,150
2      B   100   350 125,225,325
3      C    10    20          15
M. Beausoleil
  • 3,141
  • 6
  • 29
  • 61
leeum
  • 264
  • 1
  • 13
  • 2
    But I don't want to summarize all of the duplicated values, I just want to remove the lines from duplicated values that their *range* are *overlapping*. Your code is actually removing all duplicated lines regardless if they overlap or not... – M. Beausoleil Oct 30 '17 at 21:57