-1

I want to generate a synthetic data from scratch which is a binary outcome sequence data (0/1). My data has following property-

For the sake of an example, lets say there are only 3 items in the sequence, namely A,B and C So data is -

  • Its sequence based data so item A,B,C will happen in an order
  • Items A,B,C have Features S,T,U,V,X,Y,Z...etc (these features needs to have some effect on generating outcome 1, think of them as feature importance)
  • Probability of conversion when A or B or C is encountered in the data is user defined (I want control over if A occurs in any part of the sequence the overall probability of conversion to outcome 1 is 2% lets say, more below)
  • Items can repeat in a sequence so a Sequence can be like C->C->A etc .

Given the probability of conversion for each item when it occurs in data (like when ever A is encountered in the sequence, probability of outcome 1 is about 2%, when B occurs, its 2.6% and so on, just an example), I want to generate data randomly. So generated data should look something like this -

ID Sequence Feature Outcome

1   A->B     X       0
2   C->C->B  Y       1
3   A->B     X       1
4    A       Z       0
5   A->B->A  Z       0
6   C->C     Y       1

and so on

When generating this data, I want to have control over -

  • Conversion probability of A,B and C essentially defining when A occurs probability of conversion is let say 2%, for B is 4% and for C is 3.6%.
  • Number of converted sequence for each sequence length (for example there can be max 3 sequence so for 3 sequence I want at-least 100000 data points having outcome 1)
  • Control over how many Items I can include (so A,B,C and D, 4 sequence length instead of 3)
  • Total number of data points if possible?

Is there any simple way through which I generate this data with keeping in mind all these parameters?

Kshitij Yadav
  • 1,357
  • 1
  • 15
  • 35
  • I'm fairly certain you know exactly what you're asking, but it's not very clear to me as a developer what it is you're asking. To me it looks like a markov chain statistics text got put in a blender with alphabet soup. Please give a [mcve], preferably a small example that starts with actual input data, demonstrates the process, and lay out an expected output. – Daniel F Apr 26 '22 at 09:29

2 Answers2

1
import pandas as pd
import itertools
import numpy as np
import random


alphabets=['A','B','C']

combinations=[]
for i in range(1,len(alphabets)+1):
               combinations.append(['->'.join(i) for i in itertools.product(alphabets, repeat = i)])
combinations=(sum(combinations, []))

weights=np.random.normal(100,30,len(combinations))
weights/=sum(weights)
weights=weights.tolist()
#weights=np.random.dirichlet(np.ones(len(combinations))*1000.,size=1)
'''n = len(combinations)
weights = [random.random() for _ in range(n)]
sum_weights = sum(weights)
weights = [w/sum_weights for w in weights]'''


df=pd.DataFrame(random.choices(
    population=combinations,weights=weights,
    k=1000000),columns=['sequence'])

# -

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.hist(weights, bins = 20) 
plt.show()

distribution=df.groupby('sequence').agg({'sequence':'count'}).rename(columns={'sequence':'Total_Numbers'}).reset_index()
plt.hist(distribution.Total_Numbers) 
plt.show()

# + tags=[]
from tqdm import tqdm

A=0.2
B=0.8
C=0.1
count_AAA=count_AA=count_A=0
count_BBB=count_BB=count_B=0
count_CCC=count_CC=count_C=0

for i in tqdm(range(0,len(df))):
    if(df.sequence[i]=='A->A->A'):
        count_AAA+=1
    if('A->A' in df.sequence[i]):
        count_AA+=1
    if('A' in df.sequence[i]):
        count_A+=1
    if(df.sequence[i]=='B->B->B'):
        count_BBB+=1
    if('B->B' in df.sequence[i]):
        count_BB+=1
    if('B' in df.sequence[i]):
        count_B+=1
    if(df.sequence[i]=='C->C->C'):
        count_CCC+=1
    if('C->C' in df.sequence[i]):
        count_CC+=1
    if('C' in df.sequence[i]):
        count_C+=1
bi_AAA = np.random.binomial(1, A*0.9, count_AAA)
bi_AA = np.random.binomial(1, A*0.5, count_AA)
bi_A = np.random.binomial(1, A*0.1, count_A)

bi_BBB = np.random.binomial(1, B*0.9, count_BBB)
bi_BB = np.random.binomial(1, B*0.5, count_BB)
bi_B = np.random.binomial(1, B*0.1, count_B)

bi_CCC = np.random.binomial(1, C*0.9, count_CCC)
bi_CC = np.random.binomial(1, C*0.5, count_CC)
bi_C = np.random.binomial(1, C*0.15, count_C)
# -

bi_BBB.sum()/count_BBB

# + tags=[]
AAA=AA=A=BBB=BB=B=CCC=CC=C=0

for i in tqdm(range(0,len(df))):
    if(df.sequence[i]=='A->A->A'):
        df.at[i, 'Outcome_AAA'] = bi_AAA[AAA]
        AAA+=1
    if('A->A' in df.sequence[i]):
        df.at[i, 'Outcome_AA'] = bi_AA[AA]
        AA+=1
    if('A' in df.sequence[i]):
        df.at[i, 'Outcome_A'] = bi_A[A]
        A+=1
    if(df.sequence[i]=='B->B->B'):
        df.at[i, 'Outcome_BBB'] = bi_BBB[BBB]
        BBB+=1
    if('B->B' in df.sequence[i]):
        df.at[i, 'Outcome_BB'] = bi_BB[BB]
        BB+=1
    if('B' in df.sequence[i]):
        df.at[i, 'Outcome_B'] = bi_B[B]
        B+=1
    if(df.sequence[i]=='C->C->C'):
        df.at[i, 'Outcome_CCC'] = bi_CCC[CCC]
        CCC+=1
    if('C->C' in df.sequence[i]):
        df.at[i, 'Outcome_CC'] = bi_CC[CC]
        CC+=1
    if('C' in df.sequence[i]):
        df.at[i, 'Outcome_C'] = bi_C[C]
        C+=1
        
df=df.fillna(0)       


df['Outcome']=df.apply(lambda x: 1 if x.Outcome_AAA+x.Outcome_BBB+x.Outcome_CCC+\
                       x.Outcome_AA+x.Outcome_BB+x.Outcome_CC+\
                       x.Outcome_A+x.Outcome_B+x.Outcome_C>0 else 0,1)
dataset=df[['sequence','Outcome']]
0

Although it may not be the most elegant method, you can achieve this using a for loop. For each row, split a that element of Sequence into a list of events using .split(). You can find the count of each element using .count(). You can find the length using len(), and the average/total outcome using np.sum() and np.mean(). Try using this code as a starting point:

df['Outcome'] = 0

for i, j in df.iterrows():
    list_of_events = j['Sequence'].split('->')
    # do your calculations on list_of_events here
    print(len(list_of_events))
    print(list_of_events.count("A"))
    my_calculation_for_outcome = list_of_events.count("B")*0.02
    df.loc(i, ['Outcome']) = my_calculation_for_outcome

May want to look here for ensuring the Outcome column has a given number of True values: A fast way to find the largest N elements in an numpy array

Raisin
  • 345
  • 1
  • 9