3

I'm trying to generate a random data series (or a time series) for anomaly detection, with events spanning a few consecutive data points. They could be values above/below a certain threshold, or anomaly types with different known probabilities.

e.g. in a case where 1 is normal and event types are within [2, 3, 4]: 11112221113333111111112211111

I looked through the np.random and random methods, but couldn't find any that generate these events. My current solution is picking random points, adding random durations to them to generate event start and end positions, labeling each event with a random event type, and joining back to the dataset, something like:

import numpy as np
num_events = np.random.randint(1, 10)
number_series = [1]*60
first_pos = 0 
event_starts = sorted([first_pos + i for i in np.random.randint(50, size = num_events)])
event_ends = [sum(i) for i in list(zip(event_starts, np.random.randint(8, size = num_events)))]
for c in list(zip(event_starts, event_ends)):
    rand_event_type  = np.random.choice(a = [2, 3, 4], p = [0.5, 0.3, 0.2])
    number_series[c[0]:c[1]] = [rand_event_type]*len(number_series[c[0]:c[1]])
print(number_series)

[1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 3, 3, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

But I'm wondering if there is a simpler way to just generate a series of numbers with events, based on a set of probabilities.

lesk_s
  • 365
  • 1
  • 9

3 Answers3

3

It all depends on how you model your process (the underlying process you want to simulate). You can read more about some of the usual models on Wikipedia.

Simplest

In the following, we use a very simple model (slightly different than yours): events each have a probability (like in your question) and a random duration that is independent of the event itself. 1 ("normal") is an event like any others (unlike your sample code). We could change that, but right now this is one of the simplest models you can think of.

def gen_events(n):
    events = np.random.choice(a=[1, 2, 3, 4], p=[0.6, 0.2, 0.12, 0.08], size=n)
    durations = np.random.randint(1, 8, size=n)
    return np.repeat(events, durations)

np.random.seed(0)  # repeatable example
number_series = gen_events(10)  # for example

>>> number_series
array([1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
       1, 2, 2, 1, 1, 1, 1, 1, 1, 3, 4, 4, 1, 1, 1, 1, 1])

Note, this is very fast:

%timeit gen_events(1_000_000)
# 44.9 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Markov chain

Another model (easier to parameterize, a bit more complex to implement) would be a Markov model. The simplest of them would be a Markov chain. Here is a super simple (but not very efficient) version:

def markov_chain(P, n, initial_state=0):
    m = P.shape[0]
    ix = np.arange(m)
    s = np.empty(n, dtype=int)
    s[0] = initial_state
    for i in range(1, n):
        s[i] = np.random.choice(ix, p=P[s[i-1]])
    return s

Above, P is a transition matrix, where each cell P[i,j] is the probability to transition from state i to state j. Here is an example application:

P = np.array([
    [.7, .1, .12, .08],  # from 0 to others
    [.3, .6, .05, .05],
    [.3, 0, .65, .05],
    [.4, 0, .05, .55],
])

np.random.seed(0)
n = 100
s = markov_chain(P, n) + 1
>>> s
array([1, 1, 2, 2, 2, 2, 2, 2, 2, 4, 1, 2, 2, 2, 3, 1, 1, 1, 3, 3, 3, 4,
       4, 4, 4, 1, 1, 1, 4, 4, 3, 1, 2, 2, 2, 1, 1, 1, 1, 4, 4, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 3, 1, 3, 1, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 4, 1, 1, 1, 2, 1, 1, 1, 1, 3])

Note that the unigram probability of each event is called pi and corresponds to any of the rows of lim_{k -> \infty} P**k:

>>> pd.Series(markov_chain(P, 1000, 0)).value_counts(normalize=True).sort_index()
0    0.530
1    0.135
2    0.209
3    0.126

>>> np.linalg.matrix_power(P, 40)[0]
array([0.52188552, 0.13047138, 0.21632997, 0.13131313])
Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • The sample distribution of this solution is different. OP's guarantees the length and randomizes the number of events. Here it is the other way around. Nevertheless an elegant approach. – Michael Szczesny Jun 04 '22 at 21:14
  • 1
    Yes indeed. Given the relative lack of details about the desired properties of the distribution, I didn't try to match exactly the OP's process before we know more about the intent. – Pierre D Jun 04 '22 at 21:16
  • 2
    In your first approach you can guarantee the length with `durations = np.random.multinomial(size-n, np.full(n, 1/n))+1` (introducing a second parameter *size, size>=n* to the function). – Michael Szczesny Jun 04 '22 at 21:46
1

A less verbose way would be to generate your list of events on the go.

Set, for example, a probability for an occurrence of an anomaly (say, 5%). Then,

events = []
for i in range(60):
  if random() <= 0.95:
    events.append(1)
  else:
    events.extend([choice(a = [2, 3, 4], p = [0.5, 0.3, 0.2])] * randint(8))
rafaelc
  • 57,686
  • 15
  • 58
  • 82
0

You can generate random numbers from a uniform distribution over [0, 1) and use numpy.select and select which number will be 1, 2, 3, 4 like below:

import numpy as np
def generate_random_data_series(num, prob=[0.6,0.2,0.05,0.15]):
    x = np.random.rand(num)
    prob = np.cumsum(np.asarray(prob))
    condlist = [
        x < prob[0], 
        x < prob[1], 
        x < prob[2], 
        x < prob[3]
    ]
    choicelist = [1,2,3,4]
    return np.select(condlist, choicelist, default=1)

Benchmark on colab:

%timeit generate_random_data_series(1_000_000)
# 25.1 ms per loop (10 loops, best of 5)

Test function:

>>> from collections import Counter
>>> res = generate_random_data_series(100)
>>> res
array([1, 1, 4, 1, 4, 1, 1, 1, 4, 1, 3, 4, 4, 1, 1, 1, 1, 4, 1, 1, 2, 1,
       4, 1, 1, 1, 1, 1, 2, 1, 1, 4, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2,
       1, 2, 2, 1, 1, 4, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 2, 1, 2, 1, 4, 1,
       1, 1, 2, 1, 1, 1, 1, 4, 1, 4, 2, 4, 4, 4, 2, 3, 2, 2, 2, 2, 1, 1,
       2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1])
>>> Counter(res)
Counter({1: 61, 4: 15, 3: 3, 2: 21})
# prob  1 : 60%
# count 1 : 61 in 100 random number
# prob  2 : 20%
# count 2 : 21 in 100 random number
# prob  3 : 5%
# count 3 : 3 in 100 random number
# prob  4 : 15%
# count 4 : 15 in 100 random number
I'mahdi
  • 23,382
  • 5
  • 22
  • 30