I want to generate a synthetic data from scratch which is a binary outcome sequence data (0/1). My data has following property-
For the sake of an example, lets say there are only 3 items in the sequence, namely A,B and C So data is -
- Its sequence based data so item A,B,C will happen in an order
- Items A,B,C have Features S,T,U,V,X,Y,Z...etc (these features needs to have some effect on generating outcome 1, think of them as feature importance)
- Probability of conversion when A or B or C is encountered in the data is user defined (I want control over if A occurs in any part of the sequence the overall probability of conversion to outcome 1 is 2% lets say, more below)
- Items can repeat in a sequence so a Sequence can be like C->C->A etc .
Given the probability of conversion for each item when it occurs in data (like when ever A is encountered in the sequence, probability of outcome 1 is about 2%, when B occurs, its 2.6% and so on, just an example), I want to generate data randomly. So generated data should look something like this -
ID Sequence Feature Outcome
1 A->B X 0
2 C->C->B Y 1
3 A->B X 1
4 A Z 0
5 A->B->A Z 0
6 C->C Y 1
and so on
When generating this data, I want to have control over -
- Conversion probability of A,B and C essentially defining when A occurs probability of conversion is let say 2%, for B is 4% and for C is 3.6%.
- Number of converted sequence for each sequence length (for example there can be max 3 sequence so for 3 sequence I want at-least 100000 data points having outcome 1)
- Control over how many Items I can include (so A,B,C and D, 4 sequence length instead of 3)
- Total number of data points if possible?
Is there any simple way through which I generate this data with keeping in mind all these parameters?