3

I'm new to python and have a simple question for which I haven't found an answer yet. Lets say I have a time series with c(t):

t_  c_
1   40
2   41
3   4
4   5
5   7
6   20
7   20
8   8
9   90
10  99
11  10
12  5
13  8
14  8
15  19

I now want to evaluate this series with respect to how long the value c has been continuously in certain ranges and how often these time periods occur.

The result would therefore include three columns: c (binned), duration (binned), frequency. Translated to the simple example the result could look as follows:

c_      Dt_  Freq_ 
0-50    8    1 
50-100  2    1
0-50    5    1

Can you give me an advice?

Thanks in advance,

Ulrike

//EDIT: Thank you for the replies! My example data were somewhat flawed so that I couldn't show a part of my question. So, here is a new data series:

series=
t   c
1   1
2   1
3   10
4   10
5   10
6   1
7   1
8   50
9   50
10  50
12  1
13  1
14  1

If I apply the code proposed by Christoph below:

bins = pd.cut(series['c'], [-1, 5, 100])
same_as_prev = (bins != bins.shift())
run_ids = same_as_prev.cumsum()
result = bins.groupby(run_ids).aggregate(["first", "count"])

I receive a result like this:

first   count
(-1, 5]   2
(5, 100]  3
(-1, 5]   2
(5, 100]  3
(-1, 5]   3

but what I'm more interested in something looking like this:

c        length  freq
(-1, 5]    2      2
(-1, 5]    3      1
(5, 100]   3      2

How do I achieve this? And how could I plot it in a KDE plot?

Best,

Ulrike

pow
  • 31
  • 3

2 Answers2

2

Nicely asked question with an example :) This is one way to do it, most likely incomplete, but it should help you a bit.

Since your data is spaced in time by a fixed increment, I do not implement the time series and use the index as time. Thus, I convert c to an array and use np.where() to find the value in the bins.

import numpy as np

c = np.array([40, 41, 4, 5, 7, 20, 20, 8, 90, 99, 10, 5, 8, 8, 19])

bin1 = np.where((0 <= c) & (c <= 50))[0]
bin2 = np.where((50 < c) & (c <= 100))[0]

For bin1, the output is array([ 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14], dtype=int64) which correspond to the idx where the values from c are in the bin.

Next step is to find the consecutive idx. According to this SO post::

from itertools import groupby
from operator import itemgetter

data = bin1
for k, g in groupby(enumerate(data), lambda ix : ix[0] - ix[1]):
    print(list(map(itemgetter(1), g)))

# Output is:
#[0, 1, 2, 3, 4, 5, 6, 7]
#[10, 11, 12, 13, 14]

Final step: place the new sub-bin in the right order and track which bins correspond to which subbin. Thus, the complete code would look like:

import numpy as np
from itertools import groupby
from operator import itemgetter

c = np.array([40, 41, 4, 5, 7, 20, 20, 8, 90, 99, 10, 5, 8, 8, 19])

bin1 = np.where((0 <= c) & (c <= 50))[0]
bin2 = np.where((50 < c) & (c <= 100))[0]

# 1 and 2 for the range names.
bins = [(bin1, 1), (bin2, 2)]
subbins = list()

for b in bins:
    data = b[0]
    name = b[1] # 1 or 2
    for k, g in groupby(enumerate(data), lambda ix : ix[0] - ix[1]):
        subbins.append((list(map(itemgetter(1), g)), name))

subbins = sorted(subbins, key=lambda x: x[0][0])

Output: [([0, 1, 2, 3, 4, 5, 6, 7], 1), ([8, 9], 2), ([10, 11, 12, 13, 14], 1)]

Then, you just have to do the stats you want :)

Mathieu
  • 5,410
  • 6
  • 28
  • 55
1
import pandas as pd

def bin_run_lengths(series, bins):

    binned = pd.cut(pd.Series(series), bins)
    return binned.groupby(
        (1 - (binned == binned.shift())).cumsum()
    ).aggregate(
        ["first", "count"]
    )

(I'm not sure where your frequency column comes in - in the problem as you describe it, it seems like it would always be set to 1.)

Binning

Binning a series is easy with pandas.cut():

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html

import pandas as pd

pd.cut(pd.Series(range(100)), bins=[-1,0,10,20,50,100])

The bins here are given as (right-inclusive, left-exclusive) boundaries; the argument can be given in different forms.

0       (-1.0, 0.0]
1       (0.0, 10.0]
2       (0.0, 10.0]
3       (0.0, 10.0]
4       (0.0, 10.0]
5       (0.0, 10.0]
6       (0.0, 10.0]
          ...
19     (10.0, 20.0]
20     (10.0, 20.0]
21     (20.0, 50.0]
22     (20.0, 50.0]
23     (20.0, 50.0]
          ...
29     (20.0, 50.0]
          ...      
99    (50.0, 100.0]
Length: 100, dtype: category
Categories (4, interval[int64]): [(0, 10] < (10, 20] < (20, 50] < (50, 100]]

This converts it from a series of values to a series of intervals.

Count consecutive values

This doesn't have a native idiom in pandas, but it is fairly easy with a few common functions. The top-voted StackOverflow answer here puts it very well: Counting consecutive positive value in Python array

same_as_prev = (series != series.shift())

This yields a Boolean series that determines if the value is different from the one before.

run_ids = same_as_prev.cumsum()

This makes an int series that increments from 0 each time the value changes to a new run, and thus assigns each position in the series to a "run ID"

result = series.groupby(run_ids).aggregate(["first", "count"])

This yields a dataframe that shows the value in each run and the length of that run:

      first   count
0   (-1, 0]      1
1   (0, 10]     10
2   (10, 20]    10
3   (20, 50]    30
4   (50, 100]   49
Christoph Burschka
  • 4,467
  • 3
  • 16
  • 31