This function allows to estimate the entropy of a time series. It is based on the Lempel-Ziv compression algorithm. For a time series of length n, the entropy is estimate as:
E= (1/n SUM_i L_i )^-1 ln(n)
where L_i is the longness of the shortest substring starting at position i which doesn't previously appear from position 1 to i-1. The estimated entropy converges to the real entropy of the time series when n approaches to infinity.
There is already an implementation in MATLAB functions: https://cn.mathworks.com/matlabcentral/fileexchange/51042-entropy-estimator-based-on-the-lempel-ziv-algorithm?s_tid=prof_contriblnk
I would like to implement is in Python and I did it like this:
def contains(small, big):
for i in range(len(big)-len(small)+1):
if big[i:i+len(small)] == small:
return True
return False
def actual_entropy(l):
n = len(l)
sequence = [l[0]]
sum_gamma = 0
for i in range(1, n):
for j in range(i+1, n+1):
s = l[i:j]
if contains(s, sequence) != True:
sum_gamma += len(s)
sequence.append(l[i])
break
ae = 1 / (sum_gamma / n ) * math.log(n)
return ae
However, I found it calculate too slow when the data size is getting bigger. For example, I used a list of 23832 elements as an input and time consumed is like this: (data can be found here)
0-1000: 1.7068431377410889 s
1000-2000: 18.561192989349365 s
2000-3000: 84.82257103919983 s
3000-4000: 243.5819959640503 s
...
I have thousands of lists like this to be calculated and such long time is unbearable. How should I optimize this function and make it work faster?