arrange dataframe based on one column eliminating the unwanted responses

Question

I have this data

       date                                           signal 
1   2009-01-13 09:55:00  4645.00  4838.931  5358.883  Buy2
2   2009-01-14 09:55:00  4767.50  4718.254  5336.703  Buy1
3   2009-01-15 09:55:00  4485.00  4653.316  5274.384  Buy2
4   2009-01-16 09:55:00  4580.00  4537.693  5141.435  Buy1
5   2009-01-19 09:55:00  4532.00  4548.088  4891.041  Buy2
6   2009-01-27 09:55:00  4190.00  4183.503  4548.497  Buy1
7   2009-01-30 09:55:00  4436.00  4155.236  4377.907 Sell1
8   2009-02-02 09:55:00  4217.00  4152.626  4390.802 Sell2
9   2009-02-09 09:55:00  4469.00  4203.437  4376.277 Sell1
10  2009-02-12 09:55:00  4469.90  4220.845  4503.798 Sell2
11  2009-02-13 09:55:00  4553.00  4261.980  4529.777 Sell1
12  2009-02-16 09:55:00  4347.20  4319.656  4564.387 Sell2
13  2009-02-17 09:55:00  4161.05  4371.474  4548.912  Buy2
14  2009-02-27 09:55:00  3875.55  3862.085  4101.929  Buy1
15  2009-03-02 09:55:00  3636.00  3846.423  4036.020  Buy2
16  2009-03-12 09:55:00  3420.00  3372.665  3734.949  Buy1
17  2009-03-13 09:55:00  3656.00  3372.100  3605.357 Sell1
18  2009-03-17 09:55:00  3650.00  3360.421  3663.322 Sell2
19  2009-03-18 09:55:00  3721.00  3363.735  3682.293 Sell1
20  2009-03-20 09:55:00  3687.00  3440.651  3784.778 Sell2

and have to arrange it in this form

2   2009-01-14 09:55:00  4767.50  4718.254  5336.703  Buy1
7   2009-01-30 09:55:00  4436.00  4155.236  4377.907 Sell1
8   2009-02-02 09:55:00  4217.00  4152.626  4390.802 Sell2
13  2009-02-17 09:55:00  4161.05  4371.474  4548.912  Buy2
14  2009-02-27 09:55:00  3875.55  3862.085  4101.929  Buy1
17  2009-03-13 09:55:00  3656.00  3372.100  3605.357 Sell1
18  2009-03-17 09:55:00  3650.00  3360.421  3663.322 Sell2

So that data is arranged in order of Buy1 Sell1 Sell2 Buy2 and eliminating the middle observations. I have tried several dplyr:filter commands but none is giving the desired output.

Hobo Sheep · Accepted Answer · 2018-07-20T07:45:58.423

If I have well understood your problem, the following code should solve it. It is adapted from this discussion.

The idea is to define your sequence as a pattern:

pattern <- c("Buy1", "Sell1", "Sell2", "Buy2")

Then find the position of this pattern in your column:

library(zoo)
 pos <- which(rollapply(data$signal, 4, identical, pattern, fill = FALSE, align = "left"))

and extract the rows following the position of your patterns:

rows <- unlist(lapply(pos, function(x, n) seq(x, x+n-1), 4))
data_filtered <- data[rows,]

Voilà.

EDIT

Since I had misunderstood your problem, here is a new solution. You want to retrieve the sequence "Buy1", "Sell1", "Sell2", "Buy2" in your column, and eliminate the observations that do not fit in this sequence. I do not see a trivial vectorised solution, so here is a loop to solve that. Depending on the size of your data, you may want to implement a similar algorithm in RCPP or vectorise it in some ways.

sequence <- c("Buy1", "Sell1", "Sell2", "Buy2")
keep <- logical(length(data$signal))

s <- 0
for (i in seq(1, length(data$signal))){
    if (sequence[s +1] == data$signal[i]){
        keep[i] <- T
        s <- (s + 1) %% 4
    } else {
        keep[i] <- F
    }
}

data_filtered <- data[keep,]

Tell me if this work better. If anyone has a vectorised solution, I would be curious to see it.

this is not what i need. read my data carefully i want to arrange my data in this order c("Buy1", "Sell1", "Sell2", "Buy2") surely but starting from the first Buy1 appeared in data then eliminate the middle observations upto subsequent Sell1 then eliminate the middle observations upto subsequent Sell2 and so on — david jones, Jul 20 '18 at 07:01
I've edited my answer to better answer your problem. I hope your data is not too big and that solution works for you. — Hobo Sheep, Jul 20 '18 at 07:47

score 0 · Answer 2 · answered Jul 19 '18 at 11:57

0

You can coerce the column data$signal into a factor and define the levels.

data$signal <- as.factor(data.$signal, levels = c("Buy1","Sell1","Buy2","Sell2")

Then you can sort it

sorted.data <- data[order(signal),]

Here is a great answer that talks about what you want to do:

Sort data frame column by factor

answered Jul 19 '18 at 11:57

bstrain

278
1
9

if think OP wants to "eliminating the middle observations". – Andre Elrico Jul 19 '18 at 12:01

score 0 · Answer 3 · answered Jul 19 '18 at 12:05

Here is a Rcpp solution:

library(Rcpp)

cppFunction('LogicalVector FindHit(const CharacterVector x, const CharacterVector y) {
    LogicalVector res(x.size());
    int k = 0;
    for(int i = 0; i < x.size(); i++){
        if(x[i] == y[k]){
            res[i] = true;
            k = (k + 1) % y.size();
        }
    }
    return res;
}')

dtt[FindHit(dtt$V6, c('Buy1', 'Sell1', 'Sell2', 'Buy2')),]

#            V1       V2      V3       V4       V5    V6
# 2  2009-01-14 09:55:00 4767.50 4718.254 5336.703  Buy1
# 7  2009-01-30 09:55:00 4436.00 4155.236 4377.907 Sell1
# 8  2009-02-02 09:55:00 4217.00 4152.626 4390.802 Sell2
# 13 2009-02-17 09:55:00 4161.05 4371.474 4548.912  Buy2
# 14 2009-02-27 09:55:00 3875.55 3862.085 4101.929  Buy1
# 17 2009-03-13 09:55:00 3656.00 3372.100 3605.357 Sell1
# 18 2009-03-17 09:55:00 3650.00 3360.421 3663.322 Sell2

Here is the dtt:

> dput(dtt)
structure(list(V1 = c("2009-01-13", "2009-01-14", "2009-01-15", 
"2009-01-16", "2009-01-19", "2009-01-27", "2009-01-30", "2009-02-02", 
"2009-02-09", "2009-02-12", "2009-02-13", "2009-02-16", "2009-02-17", 
"2009-02-27", "2009-03-02", "2009-03-12", "2009-03-13", "2009-03-17", 
"2009-03-18", "2009-03-20"), V2 = c("09:55:00", "09:55:00", "09:55:00", 
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", 
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", 
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00"), 
    V3 = c(4645, 4767.5, 4485, 4580, 4532, 4190, 4436, 4217, 
    4469, 4469.9, 4553, 4347.2, 4161.05, 3875.55, 3636, 3420, 
    3656, 3650, 3721, 3687), V4 = c(4838.931, 4718.254, 4653.316, 
    4537.693, 4548.088, 4183.503, 4155.236, 4152.626, 4203.437, 
    4220.845, 4261.98, 4319.656, 4371.474, 3862.085, 3846.423, 
    3372.665, 3372.1, 3360.421, 3363.735, 3440.651), V5 = c(5358.883, 
    5336.703, 5274.384, 5141.435, 4891.041, 4548.497, 4377.907, 
    4390.802, 4376.277, 4503.798, 4529.777, 4564.387, 4548.912, 
    4101.929, 4036.02, 3734.949, 3605.357, 3663.322, 3682.293, 
    3784.778), V6 = c("Buy2", "Buy1", "Buy2", "Buy1", "Buy2", 
    "Buy1", "Sell1", "Sell2", "Sell1", "Sell2", "Sell1", "Sell2", 
    "Buy2", "Buy1", "Buy2", "Buy1", "Sell1", "Sell2", "Sell1", 
    "Sell2")), row.names = c(NA, -20L), class = "data.frame")

Error in source(scriptPath, local = env) :`.sourceCpp_1_DLLInfo` <- dyn.load('C:/Users/Joshila's it shows error — david jones, Jul 20 '18 at 08:02
@davidjones Sorry. This code is tested on Linux (ubuntu 16.04). I didn't have a Windows PC and cannot give any suggestion about the error message on Windows. — mt1022, Jul 20 '18 at 12:09

arrange dataframe based on one column eliminating the unwanted responses

3 Answers3

Linked