Pandas DataFrame: amount of same values in different columns in sequence for each row

Question

Let's assume I have a pandas DataFrame in Python which shows the name of the business unit leader for different units over time. It could look something like this

          Leader_Jan Leader_Feb Leader_Mar Leader_Apr
Unit1       Nina       Nina       Nina       Nina
Unit2       Lena       Lena        NaN       Lena
Unit3       Alex      Maria       Alex       Alex
Unit4     Emilia        NaN        NaN        NaN
Unit5        NaN    Corinna      Petra        NaN

and can be recreated like:

import pandas as pd
import numpy as np
a = ['Nina','Nina','Nina','Nina']
b = ['Lena','Lena',np.NaN,'Lena']
c = ['Alex','Maria','Alex','Alex']
d = ['Emilia',np.NaN,np.NaN,np.NaN]
e = [np.NaN,'Corinna','Petra',np.NaN]
data = pd.DataFrame(data=[a,b,c,d,e], columns =['Leader_Jan','Leader_Feb','Leader_Mar','Leader_Apr'], index=['Unit1','Unit2','Unit3','Unit4','Unit5'])

Context: I would like to find out in which units leaders stay for very short or very long periods (in months) to later find out whether there are team conflicts in specific units of my company.

I would like to add to the dataframe the minimum and maximum values (in months) per unit of how long leaders have been there in one uninterrupted period. Because of possible interruptions (see Unit 2 and Unit 3) I cannot just use value_counts for the different names in each row. I rather need to find the length of sequences of different leader names separated by NaN values and other names. To see what I consider a sequence check the different colors in this photo:

As you might see, an interruption as in Unit 2 and 3 should result in multiple lengths of stay. The amount of NaN-months in a sequence should not be counted.

The result should look like:

      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  \
Unit1       Nina       Nina       Nina       Nina                           4   
Unit2       Lena       Lena        NaN       Lena                           1   
Unit3       Alex      Maria       Alex       Alex                           1   
Unit4     Emilia        NaN        NaN        NaN                           1   
Unit5        NaN    Corinna      Petra        NaN                           1   

       Max_length_of_stay_leaders  
Unit1                           4  
Unit2                           2  
Unit3                           2  
Unit4                           1  
Unit5                           1

I know this might be rather complicated to follow, but I would love any kind of help/tips, etc. because I am a little lost here.

RunOrVeith · Answer 1 · 2018-11-13T13:36:55.703

This is actually quite easy using itertools.groupby:

from itertools import groupby

def min_max_durations(row):
    # the group object consumes the iterator, but we don't care about the values 
    # so we just sum "1" to get the length.
    # Taken from https://stackoverflow.com/questions/44490079/how-to-turn-an-itertools-grouper-object-into-a-list
    durations = [sum(1 for _ in group) for key, group in groupby(row) if not isinstance(key, float)]
    return min(durations), max(durations)

data["min_lengths_of_stay"], data["max_lengths_of_stay"] = zip(*data.apply(min_max_durations, axis=1))

The instance check for float is just a quick way to remove the NaN value from counting here, you can make this arbitrarily complex.

This outputs the correct results (note that copy pasting your reproduction code has 3 "Alex" entries in Unit3, unlike your example)

      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  min_lengths_of_stay  \
Unit1       Nina       Nina       Nina       Nina                    4   
Unit2       Lena       Lena        NaN       Lena                    1   
Unit3      Maria       Alex       Alex       Alex                    1   
Unit4     Emilia        NaN        NaN        NaN                    1   
Unit5        NaN    Corinna      Petra        NaN                    1   
       max_lengths_of_stay  
Unit1                    4  
Unit2                    2  
Unit3                    3  
Unit4                    1  
Unit5                    1

Thanks, you are right about the reproduction code vs. the picture, I have corrected it! — monart, Nov 16 '18 at 10:37

score 1 · Answer 2 · answered Nov 13 '18 at 13:24

This should get you started -

temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)

mins = temp.min(1)
maxs = temp.max(1)
mask = temp.apply(lambda x: x.is_monotonic_increasing and x.is_unique, axis=1)
mins.loc[mask] = maxs.loc[mask]
mins.name='Min_length_of_stay_leaders'
maxs.name='Max_length_of_stay_leaders'

df.join(mins).join(maxs)

Output

      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  \
Unit1       Nina       Nina       Nina       Nina                           4   
Unit2       Lena       Lena        NaN       Lena                           1   
Unit3       Alex      Maria       Alex       Alex                           1   
Unit4     Emilia        NaN        NaN        NaN                           1   
Unit5        NaN    Corinna      Petra        NaN                           1   

       Max_length_of_stay_leaders  
Unit1                           4  
Unit2                           2  
Unit3                           2  
Unit4                           1  
Unit5                           1

Explanation

temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)

This gives you a consecutive count of leaders grouped by their name -

    Leader_Jan  Leader_Feb  Leader_Mar  Leader_Apr
Unit1   1   2   3   4
Unit2   1   2   1   1
Unit3   1   1   1   2
Unit4   1   1   1   1
Unit5   1   1   1   1

Simply extract the max and min -

mins = temp.min(1)
maxs = temp.max(1)

The problem then comes for Nina - she did her tenure throughout, so in that case the min needs to be 4 too.

So just for that edge case, the mask object detects strictly monotonically increasing series and replaces with the max if that is the case.

I am still not sure whether it will work for all cases or not, so please check

Pandas DataFrame: amount of same values in different columns in sequence for each row

2 Answers2