1

I have a DataFrame like below, it's a transaction log:

c_id t_type unit  effective_date
1    enter  50     2/5/2020
2    exit   25     2/5/2020
1    exit   50     2/5/2020
2    enter  100    2/8/2020
3    enter  10     2/8/2020
1    enter  17     2/8/2020
3    exit   25     2/8/2020
3    exit   25     2/12/2020

I have a definition of active customer: a c_id who is entered units > exited_units.

For each day , I want to know how many active customers exists from beginning of the log.

Simplest way occur to mind is use for loop to consider each date as most recent day and get the count of active customers from beginning to that day, but I wanted more optimized scalable Pythonic/pandas way to solve this problem.

edit: My desired out put will be something like this:

effective_date  active_count
  2/5/2020           ?
  2/8/2020           ?
  2/12/2020          ?
  2/13/2020          ?
  2/14/2020          ?
  2/19/2020          ?
  2/20/2020          ?

replace question marks with active customer count of that day from beginning of the log. real issue here is For each day, we need to calculate the count from beginning of the log.

Ali Crash
  • 458
  • 2
  • 5
  • 15
  • [https://stackoverflow.com/questions/67450449/manipulate-count-data-in-python](https://stackoverflow.com/questions/67450449/manipulate-count-data-in-python) Check out this link – Kabilan Mohanraj May 08 '21 at 19:05

1 Answers1

1

I would start by using a pivot table to get the data in an easier way:

pivot = df.pivot_table(index=['c_id','effective_date'],columns='t_type',values='unit',aggfunc='sum',fill_value=0)

This gives us a format that will make it a lot easier to compare Exit and Enter values in a daily basis:

t_type               Enter  Exit
c_id effective_date             
1    2/5/2020           50    50
     2/8/2020           17     0
2    2/5/2020            0    25
     2/8/2020          100     0
3    2/12/2020           0    25
     2/8/2020           10    25

We use np.where() to perform this comparison:

pivot['Active'] = np.where(pivot['Enter'] > pivot['Exit'],1,0)

Returning:

t_type               Enter  Exit  Active
c_id effective_date                     
1    2/5/2020           50    50       0
     2/8/2020           17     0       1
2    2/5/2020            0    25       0
     2/8/2020          100     0       1
3    2/12/2020           0    25       0
     2/8/2020           10    25       0

Finally, we can a perform a groupby() by c_id and sum the Active column to get the total days in which a customer's Enter is larger than Exit:

output = pivot.reset_index().groupby('c_id')['Active'].xumsum()

Returning for the example data you provide:

c_id
1    1
2    1
3    0

EDIT: Based on the edit in the question, output should be:

output = pivot.reset_index().groupby('effective_date')['Active'].sum()
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
  • thank for your answer, but not exactly what I wanted, I updated the question, I need count for each day. calculate count for each day from beginning of the log. I explained in question as update. – Ali Crash May 08 '21 at 17:26
  • Ok, it's the same tho, groupby date then – Celius Stingher May 08 '21 at 17:27
  • your update counts actives in that specific day. I wanted to calculate count from beginning of the log to each day. – Ali Crash May 08 '21 at 17:35
  • sort of cummulative yes, example: day1:2active, day2: 5active because 3 added that day, day3:4active because one of active customers gets more exit this day and no new one came, so overall day3 have 4 actives, getting the idea? – Ali Crash May 08 '21 at 17:47