-1

I'm trying to build a dataset for the Survival Analysis using a monthly dataset I already have. My current code in Python using Pandas is something like this(I know this code is not perfect yet but I'm asking for the Big O notation here):

chrn_months = df.loc[df.label==1].month
temp_month = 0
df_final = pd.DataFrame()
for month in chrn_months:
    df_temp = df.loc[df.month==month]
    df_temp['start'] = temp_month
    df_temp['end'] = month
    df_final = pd.concat([df_final, df_temp])
    temp_month = month
    df = df.loc[df.month>=month]
if list(chrn_months)[-1] != 16:
    df_temp['start'] = df_final.end.max()
    df_temp['end'] = 16
    df_final = pd.concat([df_final, df_temp])

Left: Monthly Dataset |||| Right: Desired Dataset

I have fixed almost all glitched in the code above and it is working now, I managed to write the code but the problem is that it is too computationally heavy and I was wondering if there was a better way to do it.

Plus I also need to sum up other columns's values during the month intervals so keep that in mind please.

As you can see I'm using a for loop for creating the desired output for each ID, also we need to use another loop(nested) to filter out each ID's data inside the giant data frame so that leaves us with the Big O(n^2) which is problematic for me.

I run the code and after many hours it does not work and freezes because of huge data included.

Trying to be more clear here:

I want to get from the left dataset to the right one for many different IDs which need to be taken into account separately. (status is set to 0 only if there are no 1 labels at all during the total 16 months or there are none since some month(in this example after 12) by the end of the 16th which is the last observation, the so called "Censoring" in survival analysis) enter image description here

r3z8
  • 47
  • 6
  • 3
    Providing Data instead of images of it helps to get much faster recommendations from the community.it is one of the reasons that you get Downvote – eng Aug 06 '23 at 15:25
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Aug 06 '23 at 20:25
  • Sorry I'm pretty new to asking questions here, usually just use other people's. I want to do sums of certain features during the intervals I'm creating, want to see what happens between the times a label turns 1. – r3z8 Aug 07 '23 at 05:38
  • At first glance it looks like you need some windowing, kind of `df.loc[df['label']==1, 'month'].rolling(2)`. – Vitalizzare Aug 07 '23 at 06:25
  • Is there a reason not to convert `df['label']` to a boolean? – Vitalizzare Aug 07 '23 at 06:27
  • Yeah I think windowing will do. the right way to do it is my challenge. for the Label part, boolean works. – r3z8 Aug 07 '23 at 06:30

2 Answers2

1

Using .query(), .assign(), .shift():

Data:

import pandas as pd


data = {
    "id": [123456, 123456, 123456, 123456, 123456, 123456, 123456, 123456,
           123456, 123456, 123456, 123456, 123456, 123456, 123456, 123456],
    "month": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
    "label": [1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]
}
df = pd.DataFrame(data=data)

Code:

df = df.query(expr="label.eq(1)")
df = df.assign(
    start=df["month"].shift(1).fillna(0).astype(int),
    end=df["month"].shift(0)
).drop(columns=["month", "label"]).reset_index(drop=True)
print(df)

Output:

       id  start  end
0  123456      0    1
1  123456      1    2
2  123456      2    4
3  123456      4    5
4  123456      5    9
5  123456      9   12
6  123456     12   16
Jason Baker
  • 3,170
  • 2
  • 12
  • 15
  • Why `shift(0)`? – Vitalizzare Aug 06 '23 at 16:16
  • it's redundant but explicit – Jason Baker Aug 06 '23 at 16:17
  • thanks for the answer, this is not taking into consideration that there are more than one "IDs" and we need to do this for all of them in a big dataset, also what if I also need to sum up another existing column like the number of transactions per month during the period of each interval and show it as a separate column? do I need to change the whole thing or is it possible in the same code? – r3z8 Aug 07 '23 at 06:04
1

Case of data sorted by month

If I understand you correctly, you're trying to do something like this:

import pandas as pd


data = {
    "id": [123456]*16,
    "month": range(1,17),
    "transactions": [0, 0, 2, 0, 3, 1, 4, 0, 6, 5, 7, 5, 0, 2, 3, 7],
    "label": [1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]
}
df = pd.DataFrame(data=data)


def collect_transactions(df):
    groups = df.groupby(df['label'].cumsum().shift(fill_value=0))

    res = groups.agg({
        'month': ['first', 'last']
        , 'transactions': 'sum'
    })
    
    res[('month','first')] -= 1
    res['status'] = 1
    res.loc[len(res)-1, 'status'] = df['label'].iloc[-1]
    
    return res

# collect data for all users
all_transactions = df.groupby('id').apply(collect_transactions)

# see how it goes for each user separately
for user_id, user_group in df.groupby('id'):
    display(collect_transactions(user_group).style.set_caption(f'ID: {user_id}'))

Let's see on the data and code in details. I assume that:

  • data are sorted by month
  • df['label'] marks the right edges of the month intervals
  • the left edges of the intervals are open
  • df['transactions'] are summarized over the intervals

So we have to split data in groups by these intervals. After that, we take the sum of transactions as well as the first and last items to mark interval edges on each group.

# when marking interval with cumulative sum we have to shift result 
# in order to include records with label == 1 
# in a previous marked group as its right edge
intervals = df['label'].cumsum().shift(fill_value=0)
groups = df.groupby(intervals)

# when aggregating over a groups we can pass as an argument
# a dictionary with columns of the original data as keys,
# where values are a function or a list of functions 
# to run on a corresponding column
res = groups.agg({
    'month': ['first', 'last']
    , 'transactions': 'sum'
})

# correct the first item in a group
# to make it an open left edge of an interval
res[('month','first')] -= 1

# mark if the last interval has any labels equal 1
# which happens only if in the last record we have label == 1
# so we just copy the last label value at the end of status
res['status'] = 1
res.loc[len(res)-1, 'status'] = df['label'].iloc[-1]

What if data are not sorted by month?

If data are not sorted by month, then we can either sort them before these operations, or use pd.cut(df.month, ...) to create bins to group by. Let's see if the latter option can be useful.

# make sure we can use labels in `.loc[labels, ...]` operations
# (skip this if labels are boolean)
labels = df.label.astype(bool)   

# need inf to cover the last moth;
# later we can use it instead status to see
# if the labels ended by 1 (see the example below)
inf = float('inf')

# create a sequence of the month right edges to cut data in bins;
# include 0 and inf to cover first and last months
month_bins = sorted([0, inf, *df.loc[labels, 'month']])
grouper = pd.cut(df['month'], month_bins)

# when we group by intervals, they automatically are used 
# as index of groups, and the output of aggregation is sorted by index,
# so there's no need in the start and end points
res = (
    df['transactions']
    .groupby(grouper)
    .sum(min_count=1)    # preserve nan in the presence of (max(df.month), inf]
    .dropna()            # remove (max(df.month), inf] if exists  
    .rename('transactions')
)

Let's say we have the data:

data = {
    "id": [123456]*16,
    "month": range(16,0,-1),
    'transactions': reversed([0, 0, 2, 0, 3, 1, 4, 0, 6, 5, 7, 5, 0, 2, 3, 7]),
    "label": reversed([1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0])
}
df = pd.DataFrame(data=data)

On them we gonna have the output:

month
(0.0, 1.0]      0
(1.0, 2.0]      0
(2.0, 4.0]      2
(4.0, 5.0]      3
(5.0, 9.0]     11
(9.0, 12.0]    17
(12.0, inf]    12
Name: transactions, dtype: int64

Here (..., inf] is the same as status == 0
and (..., max(df.month)] as status == 1.

General case

The assumption here is that months by user are not duplicated.

def collect_transactions(df):
    month_bins = sorted([0, float('inf'), *df.loc[df.label, 'month']])
    grouper = pd.cut(df['month'], month_bins)
    res = (
        df['transactions']
        .groupby(grouper)
        .sum(min_count=1)   
        .dropna()           
        .rename('transactions')
    )    
    return res.to_frame()


df['label'] = df['label'].astype(bool)
user_groups = df.groupby('id')

# collect the data for all user id
# returned data have multilevel index
# with level_0 - user id, level_1 - month interval
all_transactions = user_groups.apply(collect_transactions)

# for demonstration purposes only
# let's see each group separately
for user_id, user_group in user_groups:
    display(
        collect_transactions(user_group)
        .style
        .set_caption(f'ID: {user_id}')
        .format(precision=0)
    )    

On the following test case:

data = {
    "id": [123]*16 + [456]*17
    , "month": [*range(16,0,-1), *range(1,18)]
    , "transactions": [*reversed([0, 0, 2, 0, 3, 1, 4, 0, 6, 5, 7, 5, 0, 2, 3, 7])] +\
                    [0, 0, 2, 0, 3, 1, 4, 0, 6, 5, 7, 5, 0, 2, 3, 7, 0]
    , "label": [*reversed([1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0])] +\
                    [1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]
}
df = pd.DataFrame(data)
df['label'] = df['label'].astype(bool)

we have the output:

test output

Vitalizzare
  • 4,496
  • 7
  • 13
  • 32
  • @r3z8 I'm curious how the general code performs on the big data you have. Let me know the results, ok? – Vitalizzare Aug 07 '23 at 13:00
  • Well you are a life saver!! Appreciate your thorough answer, so I reckon there is no way to do this without using a for loop to filter for each specific ID right? I'm doing concatenation in each iteration over a data frame I initiate at first. Sure I'll let you know when I run it over the whole data. – r3z8 Aug 08 '23 at 10:15
  • @r3z8 You can filter by `id` like `selected_id=[...]; df.loc[df['id'].isin(selected_id)]` – Vitalizzare Aug 08 '23 at 10:20
  • I'm doing that now, my question is whether I can avoid the for loop or not, which the answer is no, right? – r3z8 Aug 08 '23 at 10:25
  • I used fot-loop for demonstration only. All the data are accumulated in `all_transactions`. I guess, you're asking about that, no? – Vitalizzare Aug 08 '23 at 10:27
  • See new comments in the **General case** – Vitalizzare Aug 08 '23 at 10:36
  • using that I got the error, TypeError: uhashable type 'Series'. So I went ahead and used in for each individual id separately. – r3z8 Aug 08 '23 at 10:59
  • @r3z8 Have you tried to use any pandas.Series as a key in a dictionary? – Vitalizzare Aug 08 '23 at 11:02
  • no, could you elaborate please? – r3z8 Aug 08 '23 at 11:06
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254845/discussion-between-vitalizzare-and-r3z8). – Vitalizzare Aug 08 '23 at 11:08
  • I ran the code on whole data, it took about 20 hours to do it for about 800k IDs, the whole data(having 5M IDs) could take days. if I understood you correctly, there is no way to get rid of this for loop right? for user_id, user_group in df.groupby('id'): – r3z8 Aug 09 '23 at 08:49
  • @r3z8 Several ways to proceed. 1) Try to reduce using Pandas, look [this article](https://stackoverflow.com/a/76864270/14909621) for inspiration. 2) Try [code review](https://codereview.stackexchange.com/) asking how the script can be optimized, provide more info about your data there. Someone may notice features of the data that can be used to write faster code. In general, if we are talking about performance, a prototype on Pandas is often rolled to Nympy+Numba – Vitalizzare Aug 09 '23 at 09:09
  • About your question on the code performance on the whole data, it does not work so well, I'm trying to chunk it into 500k IDs and do it this way. – r3z8 Aug 14 '23 at 07:17