Generate binary outcome dummy data based on probability of items and its feature

Question

I want to generate a synthetic data from scratch which is a binary outcome sequence data (0/1). My data has following property-

For the sake of an example, lets say there are only 3 items in the sequence, namely A,B and C So data is -

Its sequence based data so item A,B,C will happen in an order
Items A,B,C have Features S,T,U,V,X,Y,Z...etc (these features needs to have some effect on generating outcome 1, think of them as feature importance)
Probability of conversion when A or B or C is encountered in the data is user defined (I want control over if A occurs in any part of the sequence the overall probability of conversion to outcome 1 is 2% lets say, more below)
Items can repeat in a sequence so a Sequence can be like C->C->A etc .

Given the probability of conversion for each item when it occurs in data (like when ever A is encountered in the sequence, probability of outcome 1 is about 2%, when B occurs, its 2.6% and so on, just an example), I want to generate data randomly. So generated data should look something like this -

ID Sequence Feature Outcome

1   A->B     X       0
2   C->C->B  Y       1
3   A->B     X       1
4    A       Z       0
5   A->B->A  Z       0
6   C->C     Y       1

and so on

When generating this data, I want to have control over -

Conversion probability of A,B and C essentially defining when A occurs probability of conversion is let say 2%, for B is 4% and for C is 3.6%.
Number of converted sequence for each sequence length (for example there can be max 3 sequence so for 3 sequence I want at-least 100000 data points having outcome 1)
Control over how many Items I can include (so A,B,C and D, 4 sequence length instead of 3)
Total number of data points if possible?

Is there any simple way through which I generate this data with keeping in mind all these parameters?

I'm fairly certain you know exactly what you're asking, but it's not very clear to me as a developer what it is you're asking. To me it looks like a markov chain statistics text got put in a blender with alphabet soup. Please give a [mcve], preferably a small example that starts with actual input data, demonstrates the process, and lay out an expected output. — Daniel F, Apr 26 '22 at 09:29

score 1 · Accepted Answer · answered Apr 27 '22 at 03:39

import pandas as pd
import itertools
import numpy as np
import random


alphabets=['A','B','C']

combinations=[]
for i in range(1,len(alphabets)+1):
               combinations.append(['->'.join(i) for i in itertools.product(alphabets, repeat = i)])
combinations=(sum(combinations, []))

weights=np.random.normal(100,30,len(combinations))
weights/=sum(weights)
weights=weights.tolist()
#weights=np.random.dirichlet(np.ones(len(combinations))*1000.,size=1)
'''n = len(combinations)
weights = [random.random() for _ in range(n)]
sum_weights = sum(weights)
weights = [w/sum_weights for w in weights]'''


df=pd.DataFrame(random.choices(
    population=combinations,weights=weights,
    k=1000000),columns=['sequence'])

# -

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.hist(weights, bins = 20) 
plt.show()

distribution=df.groupby('sequence').agg({'sequence':'count'}).rename(columns={'sequence':'Total_Numbers'}).reset_index()
plt.hist(distribution.Total_Numbers) 
plt.show()

# + tags=[]
from tqdm import tqdm

A=0.2
B=0.8
C=0.1
count_AAA=count_AA=count_A=0
count_BBB=count_BB=count_B=0
count_CCC=count_CC=count_C=0

for i in tqdm(range(0,len(df))):
    if(df.sequence[i]=='A->A->A'):
        count_AAA+=1
    if('A->A' in df.sequence[i]):
        count_AA+=1
    if('A' in df.sequence[i]):
        count_A+=1
    if(df.sequence[i]=='B->B->B'):
        count_BBB+=1
    if('B->B' in df.sequence[i]):
        count_BB+=1
    if('B' in df.sequence[i]):
        count_B+=1
    if(df.sequence[i]=='C->C->C'):
        count_CCC+=1
    if('C->C' in df.sequence[i]):
        count_CC+=1
    if('C' in df.sequence[i]):
        count_C+=1
bi_AAA = np.random.binomial(1, A*0.9, count_AAA)
bi_AA = np.random.binomial(1, A*0.5, count_AA)
bi_A = np.random.binomial(1, A*0.1, count_A)

bi_BBB = np.random.binomial(1, B*0.9, count_BBB)
bi_BB = np.random.binomial(1, B*0.5, count_BB)
bi_B = np.random.binomial(1, B*0.1, count_B)

bi_CCC = np.random.binomial(1, C*0.9, count_CCC)
bi_CC = np.random.binomial(1, C*0.5, count_CC)
bi_C = np.random.binomial(1, C*0.15, count_C)
# -

bi_BBB.sum()/count_BBB

# + tags=[]
AAA=AA=A=BBB=BB=B=CCC=CC=C=0

for i in tqdm(range(0,len(df))):
    if(df.sequence[i]=='A->A->A'):
        df.at[i, 'Outcome_AAA'] = bi_AAA[AAA]
        AAA+=1
    if('A->A' in df.sequence[i]):
        df.at[i, 'Outcome_AA'] = bi_AA[AA]
        AA+=1
    if('A' in df.sequence[i]):
        df.at[i, 'Outcome_A'] = bi_A[A]
        A+=1
    if(df.sequence[i]=='B->B->B'):
        df.at[i, 'Outcome_BBB'] = bi_BBB[BBB]
        BBB+=1
    if('B->B' in df.sequence[i]):
        df.at[i, 'Outcome_BB'] = bi_BB[BB]
        BB+=1
    if('B' in df.sequence[i]):
        df.at[i, 'Outcome_B'] = bi_B[B]
        B+=1
    if(df.sequence[i]=='C->C->C'):
        df.at[i, 'Outcome_CCC'] = bi_CCC[CCC]
        CCC+=1
    if('C->C' in df.sequence[i]):
        df.at[i, 'Outcome_CC'] = bi_CC[CC]
        CC+=1
    if('C' in df.sequence[i]):
        df.at[i, 'Outcome_C'] = bi_C[C]
        C+=1
        
df=df.fillna(0)       


df['Outcome']=df.apply(lambda x: 1 if x.Outcome_AAA+x.Outcome_BBB+x.Outcome_CCC+\
                       x.Outcome_AA+x.Outcome_BB+x.Outcome_CC+\
                       x.Outcome_A+x.Outcome_B+x.Outcome_C>0 else 0,1)
dataset=df[['sequence','Outcome']]

score 0 · Answer 2 · answered Apr 20 '22 at 14:49

Although it may not be the most elegant method, you can achieve this using a for loop. For each row, split a that element of Sequence into a list of events using .split(). You can find the count of each element using .count(). You can find the length using len(), and the average/total outcome using np.sum() and np.mean(). Try using this code as a starting point:

df['Outcome'] = 0

for i, j in df.iterrows():
    list_of_events = j['Sequence'].split('->')
    # do your calculations on list_of_events here
    print(len(list_of_events))
    print(list_of_events.count("A"))
    my_calculation_for_outcome = list_of_events.count("B")*0.02
    df.loc(i, ['Outcome']) = my_calculation_for_outcome

May want to look here for ensuring the Outcome column has a given number of True values: A fast way to find the largest N elements in an numpy array

Generate binary outcome dummy data based on probability of items and its feature

2 Answers2