1

My original data is like this

df <- structure(list(V = structure(c(4L, 5L, 3L, 7L, 6L, 2L, 1L), .Label = c("132 B26,172 B27,107 B57,104 B59,137 B60,133 B61,103 B62,134 B63,177 B100,123 B133,184 B168,109 B197,103 B198,173 B202,157 B203,143 B266,62 B342,62 B354,92 B355,195 B368,164 B370,52 B468,74 B469,71 B484,98 B494,66 B502,63 B601,133 B622", 
"135A,510A,511A,60 B23,67 B24,70 B25,95 B26,122 B27,123 B27,109 B60", 
"25A,28 B55,31 B56,45 B57,43 B58,5 B59,47 B59,6 B60,69 B60,66 B61", 
"267 B361,786 B363,543 B392", "563 B202,983 B360", "8 B1,12 B35,10 B71,9 B154,51 B179", 
"91 B26,117 B27,117 B28,102 B29,47 B31,96 B63,78 B64,133 B65,117 B66,121 B66,112 B67,127 B100"
), class = "factor")), .Names = "V", class = "data.frame", row.names = c(NA, 
-7L))

Thanks to @Arkun I can get an output with this function

Newdf <- data.frame(v1 = sapply(str_extract_all(df$V, "(?<=[A-Z])\\d+"), toString), stringsAsFactors=FALSE)

from this output,

Then I want to calculate the consecutive numbers in each row

row 1 does not have

row 2 does not have

row 3 has 1 consecutive 55,56,57,58,59,59,60,60,61

row 4 has two consecutive 26,27, 28, 29 and 63,64,65,66,66,67

row 5 does not

row 6 has 1

row 7 has has 6 (26,27) (59,60,61,62,63) (197,198) (202,203) (354,355) (468,469) Then I want to add one column showing the differences between each consecutive to next one ,

#for example (26,27) and (59,60,61,62,63)  is 59-27= 32
#(59,60,61,62,63) and (197,198) is 197-63=134
#(197,198)  and (202,203) is 202-198= 4
#(202,203) and (354,355) is 354-203= 151
#(354,355) and (468,469) is 468-355= 113

So my output will be like this

            V2              V3
            0               0
            0               0
            1               0
            2               34
            0               0
            1               0
            6            32,134,4,151,113
nik
  • 2,500
  • 5
  • 21
  • 48
  • @arkun as an example in row 7 between the two consecutive set (26,27) (59,60,61,62,63) , I will calculate their distance like this: between 26 and 26 which one is bigger ? 27, since the second set is the second , I check the smallest value so the distance between two will be 59-27 – nik Mar 03 '16 at 12:59
  • Try `sapply(str_extract_all(df$V, "(?<=[A-Z])\\d+"), function(x) {x1 <- as.numeric(x[!duplicated(x)]); sum(rle(diff(x1)==1)$values)})#[1] 0 0 1 2 0 1 6` – akrun Mar 03 '16 at 13:01
  • @arkun we always check one set with the next one in that row. so if we have 10 consecutive set, it will be like this bigger value of the first set with sampler value of the second set, then bigger value of the second set with samaller value of the third set, then smaller value of the third set with bigger value of the fourth set , this will continue until there is not any consecutive set remains – nik Mar 03 '16 at 13:02
  • The first column you will get that from the code above. – akrun Mar 03 '16 at 13:02
  • If you check my code, I didn't use the `toString` here. – akrun Mar 03 '16 at 13:11
  • Sorry, I got into another discussion and didn't check this comment. – akrun Mar 03 '16 at 13:38
  • Regarding the difference between each consecutive, there is a danger in that. The lengths could be either same or they may differ. I don't know what you wanted to do in cases where the lengths differ. – akrun Mar 03 '16 at 13:39
  • @akrun the same rule when the length of both sets are different, just we always check one set with the next one in that row. so if we have 10 consecutive set, it will be like this bigger value of the first set with sampler value of the second set, then bigger value of the second set with samaller value of the third set, then smaller value of the third set with bigger value of the fourth set , this will continue until there is not any consecutive set remains – nik Mar 04 '16 at 08:15
  • @akrun do you have any solution for this ? I have been playing around with this but I couldn't solve it – nik Mar 04 '16 at 08:15
  • I will check on that. – akrun Mar 04 '16 at 09:17
  • @akrun that would be amazing – nik Mar 04 '16 at 09:26
  • Can you please check your example. Something seems to be wrong. Why is 197, 198 not in the row 7 and why you have 484, 494, which has a difference o 10. – akrun Mar 04 '16 at 09:39
  • @akrun yes that is a typo, I will modify the text , the 197 and 198 should be in there and 484,494 should not – nik Mar 04 '16 at 09:43
  • @akrun I modified it – nik Mar 04 '16 at 09:50
  • I posted a solution. Please check if that works. – akrun Mar 04 '16 at 09:57
  • @akrun you removed your post ? – nik Mar 04 '16 at 10:39
  • Yes, as you said that it is not working in the original dataset – akrun Mar 04 '16 at 10:40
  • @akrun what should i do to solve this ? should i wait until its bounty ? because seems like no one answers! they all answer simple question and no one does go to more complicated ones! I don't know what I should do to get more help – nik Mar 04 '16 at 10:44
  • I wish I could go and check your big dataset. But, I am very busy with a project. Bounty is one way to get more attention. – akrun Mar 04 '16 at 10:46
  • @akrun I think I found where the problem is , can you only help me one thing , I see you make a list like this lst1 <- lapply(str_extract_all(df$V, "(?<=[A-Z])\\d+"), as.numeric) . is it possible to order each row from small to large values ? I think this is the problem – nik Mar 04 '16 at 11:05
  • That is simple, i.e. `lst1 <- lapply(lst1, sort)` – akrun Mar 04 '16 at 11:24
  • @akrun actually I like everything in programming, I use Matlab a lot, Python as well, I am trying to use R too, so all types of programming functions etc are interesting :-) I think I solved the problem, I am manually checking the result , then I tell you to post the code and I will accept it because it was mainly from you :-) – nik Mar 04 '16 at 11:38
  • I undeleted my post. Thanks for getting back to me. – akrun Mar 04 '16 at 11:40

1 Answers1

1

We could try

library(stringr)
library(data.table)
lst1 <- lapply(str_extract_all(df$V, "(?<=[A-Z])\\d+"), 
         as.numeric)
lst1 <- lapply(lst1, sort)
V2 <- sapply(lst1, function(x) {
         x1 <- x[!duplicated(x)]
         sum(rle(diff(x1)==1)$values)})
i1 <- V2 >1
V3 <- rep(0, length(V2))

V3[i1] <- unlist(lapply(lst1[i1], function(v1) {
        gr <- cumsum(c(TRUE,v1[-1]-v1[-length(v1)]>1))
        d1 <- data.table(v1, gr)
        d1[, if(.N >1) .SD, gr
             ][, list(v1[1], v1[.N]) , gr
              ][, {tmp <- V1-shift(V2)
                 list(toString(tmp[!is.na(tmp)]))}]
        }), use.names=FALSE)

d1 <- data.frame(V2, V3, stringsAsFactors=FALSE)
d1
#  V2                   V3
#1  0                    0
#2  0                    0
#3  1                    0
#4  2                   34
#5  0                    0
#6  1                    0
#7  6 32, 134, 4, 151, 113
akrun
  • 874,273
  • 37
  • 540
  • 662