2

Consider the following dataframe df:

import pandas as pd
d = {"A":[3, 3, 3, 2, 3, 3, 2, 2, 2, 3, 3, 2], "B": [3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3]}
df = pd.DataFrame.from_dict(d)

I am interested in identifying periods where the value equals 2 for each column. Specifically, I would like to print a message that indicates when (index) a value 2 has appeared and for how long (again in terms of indices) the value has remained 2 ignoring single occurrences. So for the above dataframe, the answer should look like this:

Column A: Value 2 was observed at instance 6 and continued till instance 8.
Column B: Value 2 was observed at instance 9 and continued till instance 10.

I can do this with whiles and for loops but is there any pythonic method of doing that? Any help is appreciated.

Tunaki
  • 132,869
  • 46
  • 340
  • 423
Sepehr
  • 442
  • 1
  • 6
  • 17
  • You can use an approach like this http://stackoverflow.com/a/7100681/2548351 looking for [2,2]. Then reverse [::-1] it and look for it again to find ending index provided it occurs only once in each column. – caiohamamura Dec 14 '15 at 12:24
  • This might help get you started: `np.logical_and(df == 2, df.diff() == 0)` - from there it's just a matter of finding the spans of True values. – John Zwinck Dec 14 '15 at 12:28
  • 1
    You could do something like `inds = df["A"][(df["A"] == 2) & (df["A"].shift() == 2)].index` – Padraic Cunningham Dec 14 '15 at 12:30

2 Answers2

2

Using numpy, one possible solution would be the following (based largely off this answer).

import pandas as pd
d = {"A":[3, 3, 3, 2, 3, 3, 2, 2, 2, 3, 3, 2], "B": [3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3]}
df = pd.DataFrame.from_dict(d)

import numpy as np

def runs_of_ones_array(bits):
    # make sure all runs of ones are well-bounded
    bounded = np.hstack(([0], bits, [0]))
    # get 1 at run starts and -1 at run ends
    difs = np.diff(bounded)
    run_starts, = np.where(difs > 0)
    run_ends, = np.where(difs < 0)
    return np.vstack((run_starts, run_ends)).T

interesting_value = 2
runs = runs_of_ones_array(df["A"] == interesting_value)
for start, end in runs:
    end -= 1
    # since we don't seem to be interested in single-element runs
    if start == end:
        continue
    print("Value {} was observed at instance {} and continued till instance {}.".format(
        interesting_value, start, end))

The output of the above is

Value 2 was observed at instance 6 and continued till instance 8.

EDIT: modified the code to only output runs with length greater than 1.

EDIT2: regarding the speed of the two posted quite similar methods, I ran some benchmarks in IPython

EDIT3: If you include the boolean mask generation time in the benchmark, the groupby method outperforms the others by nearly an order of magnitude

In [28]:
%%timeit -n 10000
mask = df == 2
for col_name in mask:
    column = mask[col_name]
    runs = runs_of_ones_array(column)
    for start, end in runs:
        end -= 1
        if start == end:
            continue
        pass
10000 loops, best of 3: 452 µs per loop

In [29]:
%%timeit -n 10000
mask = df == 2
for col_name in mask:
    column = mask[col_name]
    ind = column[column].index.values
    for sub in np.split(ind, np.where(np.diff(ind) != 1)[0]+1):
        if sub.size > 1:
            pass
        pass
10000 loops, best of 3: 585 µs per loop

In [30]:
from itertools import groupby

In [31]:
%%timeit -n 10000
for k in df:
    ind = prev = 0
    for k, v in groupby(df[k], key=lambda x: x == 2):
        ind += sum(1 for _ in v)
        if k and prev + 1 != ind:
            pass
        prev = ind
10000 loops, best of 3: 73.4 µs per loop
Community
  • 1
  • 1
Martin Valgur
  • 5,793
  • 1
  • 33
  • 45
  • 1
    May I ask why the downvote? Is it the fact that my answer is almost entirely a based off another answer on SO? If I had enough reputation posts, I would have just posted a comment. If it is anything else, then I would simply appreciate the feedback. – Martin Valgur Dec 14 '15 at 12:43
  • @martin-valur Thanks for your solution. I didn't down-vote it but I will go with what padriac-cunningham suggested as it is more efficient as I believe. – Sepehr Dec 14 '15 at 13:02
  • 1
    try timing the groupby approach, I bet it will be significantly faster – Padraic Cunningham Dec 14 '15 at 15:37
  • 1
    Thanks for adding the groupby approach. I added that to the benchmark and you were absolutely correct, it is significantly faster than the others, indeed. – Martin Valgur Dec 15 '15 at 13:59
2

You could split:

import pandas as pd
d = {"A":[3, 3, 3, 2, 3, 3, 2, 2, 2, 3, 3, 2], "B": [3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3]}
df = pd.DataFrame.from_dict(d)

mask = (df == 2) & (df.shift() == 2)

inds_a = mask["A"][mask["A"]].index.values
inds_b = mask["B"][mask["B"]].index.values

for ind in [inds_a, inds_b]:
    for sub in np.split(ind,   np.where(np.diff(ind) != 1)[0]+1):
        print("2 appeared at {} to {}".format(sub[0]-1, sub[-1]))

It is probably faster just to get the indexes and filter in the split:

mask = df == 2
inds_a = mask.A[mask.A].index.values
inds_b = mask.B[mask.B].index.values


for ind in [inds_a, inds_b]:
    for sub in np.split(ind,   np.where(np.diff(ind) != 1)[0]+1):
        if sub.size > 1:
            print("2 appeared at {} to {}".format(sub[0], sub[-1]))

Output:

2 appeared at 6 to 8
2 appeared at 8 to 9

Interestingly I found using itertools.groupby to actually be the fastest:

from itertools import groupby

for k in df:
    ind = prev = 0
    for k, v in groupby(df[k], key=lambda x: x == 2):
        ind += sum(1 for _ in v)
        if k and prev + 1 != ind:
            print("2 appeared at {} to {}".format(prev, ind - 1))
        prev = ind

Output:

2 appeared at 6 to 8
2 appeared at 8 to 9
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321