0

Below, you can find an excerpt from a time series. I would like to slice the rows set apart by the white space (the white space is for clarity's sake, there is no empty row in the data). There is a recurring pattern, i.e. 21 times value X, four different values ABCD, 21 times value Y, four different values EFGH, 21 times value Z etc. In this case, I'm interested in obtaining BCDY, FGHZ and so on.

The problem is that this pattern is sometimes interrupted due to missing data, making it irregular. As a result, the number of values I want to discard (the values X, Y, Z) is sometimes lower than 21. E.g. the values between 2014-01-20 00:05:00 and 2014-01-20 00:11:00 could as well be missing.

I can think of looping over the series, but I prefer a vectorized approach. I would like to implement it in R, but Python or Matlab will do as well.

Any ideas? Thanks.

2014-01-20 00:00:00    197021
2014-01-20 00:01:00    197021
2014-01-20 00:02:00    197021
2014-01-20 00:03:00    197021
2014-01-20 00:04:00    197021
2014-01-20 00:05:00    197021
2014-01-20 00:06:00    197021
2014-01-20 00:07:00    197021
2014-01-20 00:08:00    197021
2014-01-20 00:09:00    197021
2014-01-20 00:10:00    197021
2014-01-20 00:11:00    197021
2014-01-20 00:12:00    197021
2014-01-20 00:13:00    197021
2014-01-20 00:14:00    197021
2014-01-20 00:15:00    197021
2014-01-20 00:16:00    196836

2014-01-20 00:17:00    196865
2014-01-20 00:18:00    196787
2014-01-20 00:19:00    196915
2014-01-20 00:20:00    196902

2014-01-20 00:21:00    196902
2014-01-20 00:22:00    196902
2014-01-20 00:23:00    196902
2014-01-20 00:24:00    196902
2014-01-20 00:25:00    196902
2014-01-20 00:26:00    196902
2014-01-20 00:27:00    196902
2014-01-20 00:28:00    196902
2014-01-20 00:29:00    196902
user2143353
  • 87
  • 1
  • 6
  • What does the missing data look like? Because otherwise it sounds like you just want groups of 25 rows starting from row 27 (i.e. `BCDY = data(27:27+25, :)` in Matlab) – Dan Mar 12 '14 at 08:50
  • 1
    Can you share your data with us in a handy manner? Here are some tips on how to do it: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example I bet `rle` will be part of the end solution, though. – Roman Luštrik Mar 12 '14 at 09:03
  • Why is there a white space between ``00:16:00 196836`` and ``00:17:00 196865`` ? I understood this white space doesn't exist in fact, the real question is: what differentiates a times series from the following one ?? – eyquem Mar 12 '14 at 09:20

2 Answers2

1

If I understand you aright, you want to remove all data rows where the last column is unchanged from the previous row. In Matlab, you can do this using the diff() function and logical indexing. Assuming your data is in a two-column matrix, then the expression

data([true; diff(data(:,2))~=0],:)

will return a two-column matrix with just the data meeting the requirement. You might need to check the first row separately: I'm not quite clear from your description whether you always want the first row or not. The above will always keep it. Change the true to false to always drop it.

Edit (in response to the first comment)

Replacing true with false in the above expression discards the first row. This leaves you with blocks of 5 rows, of which you want to discard the first row in each block. This can also be done with logical indexing. It's fairly simple, except that you need to protect against the case when the last block contains less than 5 rows:

pData = data([false; diff(data(:,2))~=0],:);
selector = repmat([false; true; true; true; true], ceil(size(pData, 1))/5, 1);
pData = pData(selector(1:size(pData,1)),:);

I hope this helps!

Max
  • 2,121
  • 3
  • 16
  • 20
  • This is getting close, but it results in the 5 values that differ from the preceding series of equal values. I also want to disregard the first changed value. In the end, I want to keep the four values after the first changed value. – user2143353 Mar 13 '14 at 21:27
1

I am embarassed to explain the logic of the following code because it seems evident to me.
If any precision is needed, do ask me.

I just give this explanation:
you wrote "the number of values I want to discard (the values X, Y, Z) is sometimes lower than 21."
In my code, the hypothesis I did is that this lower number can't be inferior to 3.

The use of StringIO isn't a requirement, it is just to give me the easiness to present a code runalble as is, treating data included in the code.
The only requirement is to pass an iterable to the function slik() : it can be a generator as in the following code, or a file handler, or a list, or anything else that can be iterated.

I split the code in order to make the executing instructions appear without having to scroll down, but the two parts must be in the same coe to run it.

text = '''
2014-01-20 00:00:00    197021
2014-01-20 00:01:00    197021
2014-01-20 00:02:00    197021
2014-01-20 00:03:00    197021
2014-01-20 00:04:00    197021
2014-01-20 00:05:00    197021
2014-01-20 00:06:00    197021
2014-01-20 00:07:00    197021
2014-01-20 00:08:00    197021
2014-01-20 00:09:00    197021
2014-01-20 00:10:00    197021
2014-01-20 00:11:00    197021
2014-01-20 00:12:00    197021
2014-01-20 00:13:00    197021
2014-01-20 00:14:00    197021
2014-01-20 00:15:00    197021
2014-01-20 00:16:00    196836
2014-01-20 00:17:00    196865
2014-01-20 00:18:00    196787
2014-01-20 00:19:00    196915
2014-01-20 00:20:00    196902
2014-01-20 00:21:00    196902
2014-01-20 00:22:00    196902
2014-01-20 00:23:00    196902
2014-01-20 00:24:00    196902
2014-01-20 00:25:00    196902
2014-01-20 00:26:00    196902
2014-01-20 00:27:00    196902
2014-01-20 00:28:00    196902
2014-01-20 00:29:00    196902
'''

.

from StringIO import StringIO
rid = (x.strip() for x in StringIO(text))

def slik(INPUT):
    from collections import deque
    rows = deque( ((x for x in INPUT if x).next(),)  , maxlen=10 )
    rows.extend ( INPUT.next() for i in xrange(8) )
    val  = deque( (x.rsplit(None,1)[1] for x in rows), maxlen=10)
    for x in INPUT:
        rows.append(x)
        val.append(x.rsplit(None,1)[1])
        if val[0]==val[1]==val[2] != val[7]==val[8]==val[9]:
            for i in (3,4,5,6,7):  yield rows[i]

print '\n'.join(slik(rid))
eyquem
  • 26,771
  • 7
  • 38
  • 46