1

Imagine an employee that has worked for different companies and you want to know how many months he worked for each company. The problem is that he may have worked for two companies, overlapping some time (case of companies D and C). This overlapping time shall not be included. Another thing is that in the each month he worked, it counts as one month, no matter how long he worked, for example, in company A, duration shall be 3 months (as he worked in months 1, 2 and 3), although the real duration would be 2 months.

    Company    Start              End       Duration (months)
       A      2021-01-01       2021-03-01
       B      2021-03-03       2021-06-07
       C      2021-06-10       2021-08-28
       D      2021-04-10       2021-10-02

Considering the inputs I received here, I changed the code as below:

import io
import pandas as pd
import numpy as np

import calendar
from datetime import datetime
from dateutil.relativedelta import relativedelta
data = io.StringIO("""
Company,Start,End
A,2021-01-01,2021-03-01
B,2021-03-03,2021-06-07
C,2021-06-10,2021-08-28
D,2021-04-10,2021-10-02
""")

df = pd.read_csv(data)
def to_datetime(date):
    return datetime.strptime(date, "%Y-%m-%d")

df["Start"] = df["Start"].apply(to_datetime)
df["End"] = df["End"].apply(to_datetime)


for row in df:
    overlapping = df.Start<df.End.shift()
    df['Overlapping'] = overlapping
    if not overlapping.empty:
        duplicates = df.End.shift() - df.Start
        df['Duplicates'] = duplicates/np.timedelta64(1, 'M')
    else:
        df['Duplicates'] = 0
 
 print(df)

The problem is that I do not understand why the "Duplicates" are not zero when following else condition. They are shown as negative.

Output:

  Company      Start        End  Overlapping  Duplicates
0       A 2021-01-01 2021-03-01        False         NaN
1       B 2021-03-03 2021-06-07        False   -0.065710
2       C 2021-06-10 2021-08-28        False   -0.098565
3       D 2021-04-10 2021-10-02         True    4.599684

Thanks

qingy2019
  • 536
  • 7
  • 23
Carlos
  • 37
  • 7
  • May be helpful: https://stackoverflow.com/questions/325933/determine-whether-two-date-ranges-overlap – jarmod Oct 14 '21 at 20:36
  • What is your expected dataframe ? What if there are multiple overlaps ? I think you should clarify your problem more. –  Oct 14 '21 at 21:41
  • Thanks jarmod! Interesting stuff. Solution could be coming from there. – Carlos Oct 15 '21 at 21:18
  • Hi tako0707. The expected dataframe would be as shown above, with the column 'Duration' filled with duration in months, excluding overlappings, meaning duration should be counted only once. Multiple overlappings could happen, in the case employee worked in more than one company at the same time (3 is already too much, but it could happen). – Carlos Oct 15 '21 at 21:24

1 Answers1

2

The problem is that he may have worked for two companies, overlapping some time (case of companies D and C). This overlapping time shall not be included.

I read this like: No time is considered while working for 2 or more companies.

The example below follows this requirement. However, later in this answer, there is an example which considers overlapping time at least once.

You could do something like this:

import io
import pandas as pd

from datetime import datetime
# mock data / simulating CSV file
data = io.StringIO("""
Company,Start,End
A,2021-01-01,2021-03-01
B,2021-03-03,2021-06-07
C,2021-06-10,2021-08-28
D,2021-04-10,2021-10-02
""")

df = pd.read_csv(data)
def to_datetime(date):
    return datetime.strptime(date, "%Y-%m-%d")

df["Start"] = df["Start"].apply(to_datetime)
df["End"] = df["End"].apply(to_datetime)
def get_full_months(start, end):
    """ Returns the number of months including
        non-full months. """

    if end.year == start.year:
        return end.month - start.month + 1
    
    months = (end.year - start.year - 1) * 12
    return (12 - start.month + 1) + end.month + months
def evaluate_duration(start, end):
    """ Evaluates the duration taking other employment periods
        into account. Overlapping periods are not considered
        at all. Single days within a month are rounded up to
        a full month. """
    
    left = df[(df["End"] > start) & (df["Start"] < start)]
    right = df[(df["Start"] > start) & (df["End"] > end)]
    overlaps = df[(df["Start"] < start) & (df["End"] > end)]
    within = df[(df["Start"] > start) & (df["End"] < end)]
    
    if not left.empty:
        start = left["End"].max()

    if not right.empty:
        end = right["Start"].min()
    
    if not overlaps.empty:
        # if there are any records which overlaps the current
        # period full, the current period will not be considered
        return 0
    
    if not within.empty:
        # if there are other records within the current period,
        # these months need to be removed since overlapping
        duration = df.apply(
            lambda row: get_full_months(row["Start"], row["End"]),
            axis=1
        )
        return get_full_months(start, end) - duration.sum()
    
    return get_full_months(start, end)
df["Duration"] = df.apply(
    lambda row: evaluate_duration(row["Start"], row["End"]),
    axis=1
)

Output for print(df) would be:

  Company      Start        End  Duration
0       A 2021-01-01 2021-03-01         3
1       B 2021-03-03 2021-06-07         2
2       C 2021-06-10 2021-08-28         0
3       D 2021-04-10 2021-10-02         2

Explanation:

     Considered      Not Considered      Why
A:   Jan to Mrz
B:   Mrz, Apr        May, Jun            Overlapped on the right with D
                                         NB: Mrz and Apr are not (fully) overlapped
C:                   Jun to Aug          Fully overlapped by D
D:   Sep, Oct        Apr to Aug          Overlapped on the left with B, overlaps C

You need to consider 4 different cases of overlapping:

Left                       Right

|-------|                        |------|       df["Start"] and df["End"]
    |***ooooo|             |oooooo***|          start and end
        ^                        ^ 
        left["End"].max()        right["Start"].min()


Overlaps                   Within

|----------|                 |------|           df["Start"] and df["End"]
   |*****|                 |oo******oooooo|     start and end


Legend

|----|   employment period
 oooo    considered duration
 ***     ignored date range within employment period

In one of your comments you wrote:

[...] excluding overlappings, meaning duration should be counted only once. [...]

If you need overlapping time at least once being considered, change the function evaluate_duration to:

def evaluate_duration(start, end):
    
    left = df[(df["End"] > start) & (df["Start"] < start)]
    overlaps = df[(df["Start"] < start) & (df["End"] > end)]
    
    if not left.empty:
        start = left["End"].max()
    
    if not overlaps.empty:
        return 0
    
    return get_full_months(start, end)

Output for print(df) would be:

  Company      Start        End  Duration
0       A 2021-01-01 2021-03-01         3
1       B 2021-03-03 2021-06-07         4
2       C 2021-06-10 2021-08-28         0
3       D 2021-04-10 2021-10-02         5

Explanation:

     Considered      Not Considered      Why
A:   Jan to Mrz
B:   Mrz to Jun                          Note: Mrz does not overlap
C:                   Jun to Aug          Fully overlapped by D, therefore ignored
D:   Jun to Oct      Apr, May            Jun only partially overlaps with B

Another thing is that in the each month he worked, it counts as one month, no matter how long he worked [...]

You can achieve this with:

def get_full_months(start, end):
    """ Returns the number of months including
        non-full months. """

    if end.year == start.year:
        return end.month - start.month + 1
    
    months = (end.year - start.year - 1) * 12
    return (12 - start.month + 1) + end.month + months

Please note: The code above may not be fully tested, especially not with edge cases and other data than the provided example data.

Thomas
  • 8,357
  • 15
  • 45
  • 81
  • Hi Thomas! Thanks for your help. I learnt a lot from your code, but I am far away to follow your logic, sorry. Could you please explain how right and overlaps lines work? By the way, considering handling of 2 overlappings, company reached a wrong duration. – Carlos Oct 15 '21 at 21:14
  • @Carlos, I updated my answer. You need to consider 4 different cases of overlapping. I hope it helps. – Thomas Oct 17 '21 at 15:53
  • what a lesson!!! Thank you very much for taking your time to explain it to me in such a detailed way!!! God bless you! – Carlos Oct 17 '21 at 18:02