4

I have a data frame, please see below. How do I compress/delete rows with some conditions? I would like to choose the highest Volume only where there's a continuous 1 in V1

eg: The V1 of df[2:5,] all equals to 1, the highest Volume out of these is df[4,] so delete df[c(2,3,5),] , for df[9:10,] , highest Volume of these two is df[10,] , delete df[9,].....so on....and the result will be like Example Picture where V1 will be like 0101010101....

How to achieve this without using loops, how do I achieve this the vectorized way so calculation speed is faster(when dealing with millions of rows)?

Update: I apply the same loop with V2 but only delete when Volume is less , that's why row 13 got deleted in the Example Picture because the Volume of Row 13 is less than the Volume of Row 14.

Update(another question): I have tried akrun's method, where I choose the maximum Volume from the group, but when there're two or more maximum, there will still be a continuous 1 just like below row 9. How do I remove the duplicate row? I could use duplicated(df$Volume) , is there another way? Thanks

Update: Follow up with akrun's attempt, the code will delete the rows where V1==0 which is not what I'm after, I would like to delete only when V1==1 and the Volume is smaller than the maximum Volume of that group. I know this will have some continuous 0s in V1, and to delete the duplicated 0s in V1 will depend on the Weight so that's why I need to delete V1==1 only and cannot delete V1==0

#   Volume Weight V1 V2
#1  0.5367 0.5367  0  1
#4  1.1457 1.1413  1  0
#6  0.5694 0.5633  0  1
#7  1.2368 1.2343  1  0
#8  0.9662 0.9593  0  1
#9  1.4102 1.3923  1  0
#10 1.4102 1.3995  1  0
#11 1.1132 1.1069  0  1
#12 1.4535 1.3923  1  0
#14 1.1475 1.1447  0  1
#15 1.1790 1.1748  1  0
#18 1.1557 1.1552  0  1

Data-frame df

    Volume Weight V1 V2 
 1: 0.5367 0.5367  0  1
 2: 0.8645 0.8508  1  0
 3: 0.8590 0.8585  1  0
 4: 1.1457 1.1413  1  0
 5: 0.8573 0.8568  1  0
 6: 0.5694 0.5633  0  1
 7: 1.2368 1.2343  1  0
 8: 0.9662 0.9593  0  1
 9: 1.3550 1.3412  1  0
10: 1.4102 1.3995  1  0
11: 1.1132 1.1069  0  1
12: 1.4535 1.3923  1  0
13: 1.0437 1.0344  0  1
14: 1.1475 1.1447  0  1
15: 1.1790 1.1748  1  0
16: 1.1749 1.1735  1  0
17: 1.1749 1.1731  1  0
18: 1.1557 1.1552  0  1

Example Picture

    Volume Weight V1 V2 
 1: 0.5367 0.5367  0  1
 4: 1.1457 1.1413  1  0
 6: 0.5694 0.5633  0  1
 7: 1.2368 1.2343  1  0
 8: 0.9662 0.9593  0  1
10: 1.4102 1.3995  1  0
11: 1.1132 1.1069  0  1
12: 1.4535 1.3923  1  0
14: 1.1475 1.1447  0  1
15: 1.1790 1.1748  1  0
18: 1.1557 1.1552  0  1
Jimmy
  • 427
  • 2
  • 16
  • Related post: https://stackoverflow.com/questions/37809094/create-group-names-for-consecutive-values – zx8754 May 24 '17 at 10:21
  • 1
    according to the description you would not delete row 13 but according to the output picture you did? – Tonio Liebrand May 24 '17 at 10:51
  • 1
    I apply the same loop with `V2` but only delete when Volume is less , that's why row 13 got deleted in the Picture because the Volume of Row 13 is less than the Volume of Row 14. – Jimmy May 25 '17 at 00:41
  • @Jimmy Find your updated output. Why you not deleting the 9th row. It has `V1` as 1. `df[!with(df, Volume < ave(Volume, grp, FUN = max) & V1 ==1),]` – akrun May 25 '17 at 07:51
  • 1
    @akrun I used `df[with(df, ave(Volume, grp, FUN = max)==Volume),]` and also `df[!with(df, Volume < ave(Volume, grp, FUN = max) & V1 ==1),]` but the 9th row cannot be deleted because the Volume for 9th and 10th rows are the same – Jimmy May 25 '17 at 08:12
  • @Jimmy that logic was not found in your description – akrun May 25 '17 at 08:14
  • 1
    @akrun I just found out about this problem so I added a `Update(another question):` in my description. Anyway, I guess I can use `duplicated(df$Volume)` to remove the same Volume rows. – Jimmy May 25 '17 at 08:20
  • @Jimmy Yes, you can use `duplicated`. Also, in your input data, the 9th and 10th row for Volume are not the same – akrun May 25 '17 at 08:20

3 Answers3

3

You could use library(data.table):

setDT(df)[, .SD[(Volume == max(Volume) & V1 == 1) | V1 != 0], by = rleid(df$V1)][]

Edit:

Concerning the column dropping problem, the adapted trick from akrun:

setDT(df)[df[, .I[(Volume == max(Volume) & V1 == 1) | V1 == 0], rleid(V1)]$V1][]
Tonio Liebrand
  • 17,189
  • 4
  • 39
  • 59
  • 1
    Assuming that there is only 1 and 0 in 'V1', do you need that V1==1 or V1 == 0 – akrun May 24 '17 at 10:30
  • 1
    you are right, i made an edit. I also noticed that i replace the V1 column with an rleid column. Unfortunately, i dont have the time to "repare" it now elegantly,.. – Tonio Liebrand May 24 '17 at 10:33
  • 1
    I guess you can use the row index `setDT(df)[df[, .I[Volume == max(Volume)], rleid(V1)]$V1]` – akrun May 24 '17 at 10:35
  • 1
    nice work! Small remark: That would also remove the non-max for the consec. V1=0, which matches the output picture but not the description. Nonetheless, feel free to add it to your answer then i can delete my poor attempt :). – Tonio Liebrand May 24 '17 at 10:52
  • Weird, I got `Null data.table (0 rows and 0 cols)` with `setDT(df)[df[, .I[(Volume == max(Volume) & V1 == 1) | V1 == 0], rleid(V1)]$V1][]` what seems to be the problem? – Jimmy May 25 '17 at 06:12
  • no for me that works. I read in the data with `fread()`, but that yields a difference? – Tonio Liebrand May 25 '17 at 08:12
  • @BigDataScientist What's the `$V1` at the end of `setDT(df)[df[, .I[(Volume == max(Volume) & V1 == 1) | V1 == 0], rleid(V1)]$V1][]` for ? It seems the problem related to that. – Jimmy May 25 '17 at 08:41
  • just run; `df[, .I[(Volume == max(Volume) & V1 == 1) | V1 == 0], rleid(V1)]` by itself and you see it returns two columns which were created by the necessary grouping by `rleid()`. But you dont want to keep `rleid`, so you only access `V1`. What it returns is the indices of the rows you want to access,... – Tonio Liebrand May 25 '17 at 08:50
  • I see. Thank you very much. – Jimmy May 25 '17 at 09:27
2

We can use rle from base R

grp <- inverse.rle(within.list(rle(df$V1), values <- seq_along(values)))
df[with(df, ave(Volume, grp, FUN = max)==Volume),]
#   Volume Weight V1 V2
#1  0.5367 0.5367  0  1
#4  1.1457 1.1413  1  0
#6  0.5694 0.5633  0  1
#7  1.2368 1.2343  1  0
#8  0.9662 0.9593  0  1
#10 1.4102 1.3995  1  0
#11 1.1132 1.1069  0  1
#12 1.4535 1.3923  1  0
#14 1.1475 1.1447  0  1
#15 1.1790 1.1748  1  0
#18 1.1557 1.1552  0  1

NOTE: We used a data.frame and not data.table as input data

akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Awesome job. If I want the minimum, do I just change to `FUN = min`? Also, sometimes the Volume will be the same, so instead of 0101, I will get 0110 or 01110 (depends on how many rows the Volume are the same )....etc. How do I delete the extra `1`s that makes all the `V1` column 01010101. Cheers – Jimmy May 25 '17 at 01:18
  • Is there a more convenient way than using `duplicated(df$Volume)` on the `df` ? Thanks – Jimmy May 25 '17 at 02:30
  • @Jimmy Could you please update your question. You can change `max` to `min` for the case mentioend – akrun May 25 '17 at 04:07
  • Also, your method delete some rows where `V1 == 0`, you cannot delete rows where `V1==0`, can only delete `V1==1` – Jimmy May 25 '17 at 05:23
2

Alternately, you can try using tidyverse functions.

Considering dftest as the initial dataframe, mark the consecutive groups:

dftest$f1 <- ifelse(x <- dftest$V1 == 1, cumsum(c(head(x, 1), tail(x, -1) - head(x, -1) == 1)), NA)
dftest$f2 <- ifelse(x <- dftest$V2 == 1, cumsum(c(head(x, 1), tail(x, -1) - head(x, -1) == 1)), NA)

and then operate on those groups:

dftest %>% 
  group_by(f1) %>%
  filter( if_else(is.na(f1), Volume == Volume, Volume == max(Volume))) %>%
  ungroup() %>%
  group_by(f2) %>%
  filter( if_else(is.na(f2), Volume == Volume, Volume == max(Volume)))

which gives:

Source: local data frame [11 x 6]
Groups: f2 [7]

   Volume Weight    V1    V2    f1    f2
    <dbl>  <dbl> <int> <int> <int> <int>
1  0.5367 0.5367     0     1    NA     1
2  1.1457 1.1413     1     0     1    NA
3  0.5694 0.5633     0     1    NA     2
4  1.2368 1.2343     1     0     2    NA
5  0.9662 0.9593     0     1    NA     3
6  1.4102 1.3995     1     0     3    NA
7  1.1132 1.1069     0     1    NA     4
8  1.4535 1.3923     1     0     4    NA
9  1.1475 1.1447     0     1    NA     5
10 1.1790 1.1748     1     0     5    NA
11 1.1557 1.1552     0     1    NA     6
Aramis7d
  • 2,444
  • 19
  • 25