0

I have a dataset of parking transactions with entry and exit date timestamps and I am interested in finding the occupancy of the parking building for whenever a transaction takes place.

Sample dataset:

# Stack overflow sample
data = {'ENTRY DATE':['10/1/2022 12:00:09 AM',
                      '10/1/2022 12:01:49 AM',
                      '10/1/2022 12:02:54 AM',
                      '10/1/2022 12:12:30 AM',
                      '10/1/2022 12:15:24 AM'],
        'EXIT DATE': ['10/1/2022 12:22:20 AM',
                      '10/1/2022  12:06:13 AM',
                      '10/1/2022  12:22:19 AM',
                      '10/1/2022  12:23:35 AM',
                      '10/1/2022  12:22:32 AM'],
        'TYPE': ['Short term parking',
                 'Short term parking',
                 'Short term parking',
                 'Short term parking',
                 'Short term parking']
       }

# Create dataframe
df = pd.DataFrame(data)

# Change data type
df['ENTRY DATE'] = pd.to_datetime(df['ENTRY DATE'])
df['EXIT DATE'] = pd.to_datetime(df['EXIT DATE'])

My current solution is creating the following query but it is extremely slow for a dataset with hundreds of thousands of rows:

def get_occ(df):
    count_list = []
    for exit_date in df['EXIT DATE']:
        # Perform Query, append count to list
        count = df.query("@exit_date > `ENTRY DATE` & @exit_date <= `EXIT DATE`" )['TYPE'].count()
        count_list.append(count)
    # Add counts to df
    df['OCCUPANCY'] = count_list

A sampled desired output:

    ENTRY DATE          EXIT DATE           TYPE                OCCUPANCY
0   2022-10-01 00:00:09 2022-10-01 00:22:20 Short term parking  3
1   2022-10-01 00:01:49 2022-10-01 00:06:13 Short term parking  3
2   2022-10-01 00:02:54 2022-10-01 00:22:19 Short term parking  4
3   2022-10-01 00:12:30 2022-10-01 00:23:35 Short term parking  1
4   2022-10-01 00:15:24 2022-10-01 00:22:32 Short term parking  2

Any suggestions or alternatives for how I can retrieve the result I am looking for?

Oz123
  • 11
  • 1
  • 1
    Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. – itprorh66 Feb 01 '23 at 18:22
  • @itprorh66 thank you for the suggestion, I will revise the question with sample input and a desired output. – Oz123 Feb 01 '23 at 18:46

1 Answers1

0

We wish to report number of cars in a garage when a given car has just entered.

The key is that you need to be able to rapidly assess how many exits occurred prior to that entrance event.

Here is one approach, which has O(n^2) quadratic performance.

df["entry"] = pd.to_datetime(df.entry)
df["exit"] = pd.to_datetime(df.exit)


def get_occupancies(df):
    df = df.set_index("exit")  # for faster queries
    num_entries = 0
    for _, row in df.sort_values("entry").iterrows():
        num_entries += 1
        num_exits = df.query("exit < @row.entry").shape[0]
        yield num_entries - num_exits, row.entry


for occ, entry in get_occupancies(df):
    print(f"{occ:5d}   {entry}")

output:

    1   2022-10-01 00:00:09
    2   2022-10-01 00:01:49
    3   2022-10-01 00:02:54
    3   2022-10-01 00:12:30
    4   2022-10-01 00:15:24

A variant approach would be to create a summary report of car counts at regular intervals. Then queries would only have quadratic badness within a day or an hour: begin with a count that is prior to entry date, then query for exits which happened between then and now.


Much better to incur O(n log n) sorting cost and look at a total ordering of in / out events:

def get_occupancy_deltas(df):
    for _, row in df.iterrows():
        assert row.entry < row.exit
        yield row.entry, 1
        yield row.exit, -1


def get_occupancies_quickly(df):
    occ = 0
    for stamp, delta in sorted(get_occupancy_deltas(df)):
        occ += delta
        if delta == 1:
            yield occ, stamp

Output continues to be:

    1   2022-10-01 00:00:09
    2   2022-10-01 00:01:49
    3   2022-10-01 00:02:54
    3   2022-10-01 00:12:30
    4   2022-10-01 00:15:24

Adjust the if to view exit events, or remove it to view all events together.

Use df.merge() to add a df.occupancy column, if you like. Additionally yielding the index would make sense, in case there's any duplicated timestamps.

J_H
  • 17,926
  • 4
  • 24
  • 44