Entropy Estimator based on the Lempel-Ziv algorithm using Python

Question

This function allows to estimate the entropy of a time series. It is based on the Lempel-Ziv compression algorithm. For a time series of length n, the entropy is estimate as:

E= (1/n SUM_i L_i )^-1 ln(n)

where L_i is the longness of the shortest substring starting at position i which doesn't previously appear from position 1 to i-1. The estimated entropy converges to the real entropy of the time series when n approaches to infinity.

There is already an implementation in MATLAB functions: https://cn.mathworks.com/matlabcentral/fileexchange/51042-entropy-estimator-based-on-the-lempel-ziv-algorithm?s_tid=prof_contriblnk

I would like to implement is in Python and I did it like this:

def contains(small, big):
    for i in range(len(big)-len(small)+1):
        if big[i:i+len(small)] == small:
            return True
    return False

def actual_entropy(l):
    n = len(l)
    sequence = [l[0]]
    sum_gamma = 0

    for i in range(1, n):
        for j in range(i+1, n+1):
            s = l[i:j]
            if contains(s, sequence) != True:
                sum_gamma += len(s)
                sequence.append(l[i])
                break

    ae = 1 / (sum_gamma / n ) * math.log(n)            
    return ae

However, I found it calculate too slow when the data size is getting bigger. For example, I used a list of 23832 elements as an input and time consumed is like this: (data can be found here)

0-1000: 1.7068431377410889 s
1000-2000: 18.561192989349365 s
2000-3000: 84.82257103919983 s
3000-4000: 243.5819959640503 s
...

I have thousands of lists like this to be calculated and such long time is unbearable. How should I optimize this function and make it work faster?

Instead of working with lists, you can try to work with numpy - this may speed up the process — Kev1n91, Sep 19 '17 at 09:43
Besides being careful with data-structures (many appends on array; ouch; numpy won't help here), use a profiler to check what's slow. The two candidates here are ```contains``` and ```sequence.append```. The former can be probably optimized by reusing the old position as start, but you have to check your algorithm. — sascha, Sep 19 '17 at 11:10

cstich · Answer 1 · 2017-12-20T18:30:04.987

I played around a bit and tried a few different approaches from another thread on StackOverflow. And this is the code I came up with:

def contains(small, big):
    try:
        big.tostring().index(small.tostring())//big.itemsize
        return True
    except ValueError:
        return False

def actual_entropy(l):
    n = len(l)
    sum_gamma = 0

    for i in range(1, n):
        sequence = l[:i]

        for j in range(i+1, n+1):
            s = l[i:j]
            if contains(s, sequence) != True:
                sum_gamma += len(s)
                break

    ae = 1 / (sum_gamma / n) * math.log(n)
    return ae

Funnily enough casting the numpy arrays to strings is faster than working with strings directly. A very crude benchmark of my code on my machine with the data you provide is:

   N:  my code - your code
1000:   0.039s -    1.039s
2000:   0.266s -   18.490s
3000:   0.979s -   74.761s
4000:   2.891s -  285.488s

You maybe can make this even faster if you parallelize the outer loop.

Entropy Estimator based on the Lempel-Ziv algorithm using Python

1 Answers1