3

I have the following dataframe containing values for angular change in degree, over multiple rows:

'data.frame':   712801 obs. of  4 variables:
 $ time_passed: int  1 2 3 4 5 6 7 8 9 10 ...
 $ dRoll      : num  0.9798 -0.5099 -0.0974 -0.4985 0.1719 ...
 $ dPitch     : num  -0.175 -0.0655 0.0653 0.8907 -1.0893 ...
 $ dYaw       : num  0.33232 0.06875 -0.00573 0.59588 -0.55577 ...

> myData[1:20,]
time_passed       dRoll       dPitch      dYaw
       1          0.97975783 -0.17498131  0.332315521
       2         -0.50993244 -0.06548908  0.068754935
       3         -0.09740283  0.06531719 -0.005729578
       4         -0.49847328  0.89072019  0.595876107
       5          0.17188734 -1.08930736 -0.555769061
       6          0.68181978  0.36852645  0.492743704
       7          1.07143108  0.15206300 -0.635983153
       8         -1.43812407 -0.76638835 -0.509932438
       9          0.43544792  0.41241502  0.767763445
      10          0.25210143  0.61375239  0.509932438
      11          0.38961130  0.01203211 -0.360963411
      12          0.03437747 -0.29633377 -0.315126787
      13         -0.33804510 -0.40639896 -0.177616916
      14          0.68181978  0.32446600  0.435447924
      15         -1.12872686 -0.37752189 -0.275019742
      16          0.75057471  0.33907642  0.464095814
      17         -0.25783101  0.11310187  0.309397209
      18         -0.01718873 -0.13435860 -0.521391594
      19          0.12605071  0.12817066 -0.085943669
      20          0.02291831 -0.59856901 -0.120321137

How would I write something like

"If the sum of subsequent negative (or positive) values is smaller than my threshold (say, 5° change), then trow it out of the data set"

in R code?

I would like to apply this criterion to any of the rows, so dRoll or dPitch or dYaw.


In this case, applied based on the dRoll column, the output would be:

time_passed       dRoll       dPitch      dYaw
       1          0.97975783 -0.17498131  0.332315521
       5          0.17188734 -1.08930736 -0.555769061
       6          0.68181978  0.36852645  0.492743704
       7          1.07143108  0.15206300 -0.635983153
       9          0.43544792  0.41241502  0.767763445
      10          0.25210143  0.61375239  0.509932438
      11          0.38961130  0.01203211 -0.360963411
      12          0.03437747 -0.29633377 -0.315126787
      14          0.68181978  0.32446600  0.435447924
      16          0.75057471  0.33907642  0.464095814
      19          0.12605071  0.12817066 -0.085943669
      20          0.02291831 -0.59856901 -0.120321137

All negative runs in dRoll were thrown out, because the sums of subsequent negative values were smaller than 5 degree:

  • First negative run in dRoll: sum(myData[2:4,2]) = -1.105809
  • Second, third and forth runs are only one number: -1.43812, -0.33804, -1.12872
  • Last run in dRoll: sum(myData[17:18,2]) = -0.2750197

How would one do that in R?

Jaap
  • 81,064
  • 34
  • 182
  • 193
Joris
  • 417
  • 4
  • 17
  • Could you post your desired output? – m-dz May 16 '16 at 10:32
  • You have just filtered out rows with negative values in dRoll. Maybe you can elaborate on this, e.g. with step by step calculations? – m-dz May 16 '16 at 11:10
  • @M.D, I tried doing that, I hope it's more clear now what I am trying to do. The point is that if one of the negative runs would have added up to more than my threshold value, it would have to stay in the dataframe. – Joris May 16 '16 at 11:21

1 Answers1

4

My advise would be to melt your dataframe into long format first. After that you can do grouped operations much easier.

Using the data.table package (which we need for the melt and rleid functions):

# load the package
library(data.table)

# melt into long format
DT2 <- melt(DT, id = 'time_passed')

# create a cummulative sum for each run
# 'rleid(value > 0)' creates a grouping variable for runs of consecutive positive/negative values
# by adding '[.N]' to 'cumsum(value)' you set all values in 'csum' to the highest value
# for each run, which we can use to filter the data
DT2[, csum := cumsum(value)[.N], by = .(variable, rleid(value > 0))]

# filter the data according to a rule
# in this case only the values between -1.2 and -0.2 are filtered out
DT2[csum < -1.2 | csum > -0.2]

which gives (a snapshot of the result):

    time_passed variable        value         csum
 1:           1    dRoll  0.979757830  0.979757830
 2:           5    dRoll  0.171887340  1.925138200
 3:           6    dRoll  0.681819780  1.925138200
 4:           7    dRoll  1.071431080  1.925138200
 5:           8    dRoll -1.438124070 -1.438124070
 6:           9    dRoll  0.435447920  1.111538120
....
....
14:           3   dPitch  0.065317190  0.956037380
15:           4   dPitch  0.890720190  0.956037380
16:           6   dPitch  0.368526450  0.520589450
17:           7   dPitch  0.152063000  0.520589450
18:           9   dPitch  0.412415020  1.038199520
19:          10   dPitch  0.613752390  1.038199520
....
....
26:           1     dYaw  0.332315521  0.401070456
27:           2     dYaw  0.068754935  0.401070456
28:           3     dYaw -0.005729578 -0.005729578
29:           4     dYaw  0.595876107  0.595876107
30:           6     dYaw  0.492743704  0.492743704
31:           9     dYaw  0.767763445  1.277695883
Jaap
  • 81,064
  • 34
  • 182
  • 193