Let's assume I have a pandas DataFrame in Python which shows the name of the business unit leader for different units over time. It could look something like this
Leader_Jan Leader_Feb Leader_Mar Leader_Apr
Unit1 Nina Nina Nina Nina
Unit2 Lena Lena NaN Lena
Unit3 Alex Maria Alex Alex
Unit4 Emilia NaN NaN NaN
Unit5 NaN Corinna Petra NaN
and can be recreated like:
import pandas as pd
import numpy as np
a = ['Nina','Nina','Nina','Nina']
b = ['Lena','Lena',np.NaN,'Lena']
c = ['Alex','Maria','Alex','Alex']
d = ['Emilia',np.NaN,np.NaN,np.NaN]
e = [np.NaN,'Corinna','Petra',np.NaN]
data = pd.DataFrame(data=[a,b,c,d,e], columns =['Leader_Jan','Leader_Feb','Leader_Mar','Leader_Apr'], index=['Unit1','Unit2','Unit3','Unit4','Unit5'])
Context: I would like to find out in which units leaders stay for very short or very long periods (in months) to later find out whether there are team conflicts in specific units of my company.
I would like to add to the dataframe the minimum and maximum values (in months) per unit of how long leaders have been there in one uninterrupted period. Because of possible interruptions (see Unit 2 and Unit 3) I cannot just use value_counts for the different names in each row. I rather need to find the length of sequences of different leader names separated by NaN values and other names. To see what I consider a sequence check the different colors in this photo:
As you might see, an interruption as in Unit 2 and 3 should result in multiple lengths of stay. The amount of NaN-months in a sequence should not be counted.
The result should look like:
Leader_Jan Leader_Feb Leader_Mar Leader_Apr Min_length_of_stay_leaders \
Unit1 Nina Nina Nina Nina 4
Unit2 Lena Lena NaN Lena 1
Unit3 Alex Maria Alex Alex 1
Unit4 Emilia NaN NaN NaN 1
Unit5 NaN Corinna Petra NaN 1
Max_length_of_stay_leaders
Unit1 4
Unit2 2
Unit3 2
Unit4 1
Unit5 1
I know this might be rather complicated to follow, but I would love any kind of help/tips, etc. because I am a little lost here.