1

I would like to compute spell lengths based on equality of the adjacent column in a pandas dataframe. What is the best way to do this?

An example:

import pandas as pd
d1 = pd.DataFrame([['4', '4', '4', '5'], ['23', '23', '24', '24'], ['112', '112', '112', '112']], 
              index=['c1', 'c2', 'c3'], columns=[1962, 1963, 1964, 1965])

produces a dataframe that looks like

enter image description here

I would like to return a dataframe such as the following below. This output documents the number of spells that occur on each row. In this case c1 has 2 spells the first one occurs in 1962 to 1964 and the second starts and finishes in 1965:

enter image description here

And a dataframe that describes the spell length as shown below. For example c1 has one spell of 3 years and a second spell of 1 year long in duration.

enter image description here

This re-coding is useful in survival analysis.

sanguineturtle
  • 1,425
  • 2
  • 15
  • 29
  • I've read your question multiple times and I don't understand what you are asking and the desired output, can you explain a bit clearer with examples – EdChum Aug 06 '14 at 07:09

2 Answers2

1

The following works for your dataset, needed to ask a question in order to reduce my original answer to using list comprehensions and itertools:

In [153]:

def num_spells(x):
    t = list(x.unique())
    return [t.index(el)+1 for el in x]

d1.apply(num_spells, axis=1)

Out[153]:
    1962  1963  1964  1965
c1     1     1     1     2
c2     1     1     2     2
c3     1     1     1     1

In [144]:
from itertools import chain, repeat
def spell_len(x):
    t = list(x.value_counts())
    return list(chain.from_iterable(repeat(i,i) for i in t))

d1.apply(spell_len, axis=1)
Out[144]:
    1962  1963  1964  1965
c1     3     3     3     1
c2     2     2     2     2
c3     4     4     4     4
Community
  • 1
  • 1
EdChum
  • 376,765
  • 198
  • 813
  • 562
0

I've updated the num_spells suggested by @EdChum and added consideration for the presence of np.nan values

def compute_number_of_spells(wide_df):
    """
    Compute Number of Spells in a Wide DataFrame for Each Row
    Columns : Time Data
    """
    def num_spells(x):
        """ Compute the spells in each row """
        t = list(x.dropna().unique())
        r = []
        for el in x:
            if not np.isnan(el):                
                r.append(t.index(el)+1)
            else:
                r.append(np.nan)            #Handle np.nan case
        return r
    wide_df = wide_df.apply(num_spells, axis=1)
    return wide_df
sanguineturtle
  • 1,425
  • 2
  • 15
  • 29