7

I'm comparing the libraries dtaidistance, fastdtw and cdtw for DTW computations. This is my code:

from fastdtw import fastdtw
from cdtw import pydtw
import fastdtw
import array
from timeit import default_timer as timer
from dtaidistance import dtw, dtw_visualisation as dtwvis

s1 = mySampleSequences[0] # first sample sequence consisting of 3000 samples
s2 = mySampleSequences[1] # second sample sequence consisting of 3000 samples

start = timer()
distance1 = dtw.distance(s1, s2)
end = timer()
start2 = timer()
distance2 = dtw.distance_fast(array.array('d',s1),array.array('d',s2))
end2 = timer()
start3 = timer()
distance3, path3 = fastdtw(s1,s2)
end3 = timer()
start4 = timer()
distance4 = pydtw.dtw(s1,s2).get_dist()
end4 = timer()

print("dtw.distance(x,y) time: "+ str(end - start))
print("dtw.distance(x,y) distance: "+str(distance1))
print("dtw.distance_fast(x,y) time: "+ str(end2 - start2))
print("dtw.distance_fast(x,y) distance: " + str(distance2))
print("fastdtw(x,y) time: "+ str(end3 - start3))
print("fastdtw(x,y) distance: " + str(distance3))
print("pydtw.dtw(x,y) time: "+ str(end4 - start4))
print("pydtw.dtw(x,y) distance: " + str(distance4))

This is the output I get:

  • dtw.distance(x,y) time: 22.16925272245262
  • dtw.distance(x,y) distance: 1888.8583853746156
  • dtw.distance_fast(x,y) time: 0.3889036471839056
  • dtw.distance_fast(x,y) distance: 1888.8583853746156
  • fastdtw(x,y) time: 0.23296659641047412
  • fastdtw(x,y) distance: 27238.0
  • pydtw.dtw(x,y) time: 0.13706478039556558
  • pydtw.dtw(x,y) distance: 17330.0

My question is: Why do I get different performances and different distances? Thank you very much for your comments.

// edit: The unit of the time measurements is seconds.

Hagbard
  • 3,430
  • 5
  • 28
  • 64

2 Answers2

5

Some additional information on top of Felipe Mello's informative answer (disclaimer: author of DTAIDistance here).

For the distance results:

  • DTAIDistance only uses Euclidean distance (or L2 norm), this is hardcoded. This choice was made to speed up the execution of the C-code (no function calls). The 'fast' refers to using the C-based implementation instead of a pure Python version and both methods thus give the exact same results.
  • FastDTW is a different algorithm than DTW. It is a linear approximation. The 'fast' refers to a lower complexity.
  • cDTW. I'm not very familiar with this toolbox but it seems to implement L1 norm.

For the speed results:

In general, pure C-based algorithms are ~100 times faster than pure Python ones (in DTAIDistance this is the difference between distance() and distance_fast()). For the C-based methods the differences are mainly due to flexibility of the methods. Passing a custom norm, for example, will slow down the method (more function calls). Also, different methods have different options which cause more or less switch statements in the algorithm. For example, DTAIDistance, offers quite a number of options to tune the method because it prefers early stopping the computation over further optimizations (also observed by Felipe Mello). Furthermore, different methods store different amounts of data. The DTAIDistance distance method does not store the entire matrix to also offer linear space complexity (the full matrix is obtained using the warping_paths method that has quadratic space complexity). In general for DTW it is recommended to use a window to reduce also the time complexity a bit.

For DTAIDistance, all the design choices were made with timeseries clustering applications in mind (the distance_matrix_fast method). This is another reason not to allow custom norms. The DTW code needs to be pure C to support parallelization on the level of C-code and have minimal overhead (it uses OpenMP) to compute all pairwise distances between series.

wannesm
  • 116
  • 1
  • 4
  • Are you saying that `dtaidistance.dtw.distance()`computes Euclidean distance and not DTW distance? – Philipp Dec 21 '20 at 13:35
  • I am asking because in my experiments, `fastdtw.fastdtw()` was able to more accurately calculate the distance between datapoints – Philipp Dec 21 '20 at 13:40
  • Hi @Philipp , it is DTW. Inside the DTW algorithm one also needs to compute the distance between two datapoints, thus dtw(x,y) uses a subroutine inner_dist(x[t], y[t]). This inner distance is L2 for dtaidistance while some other toolboxes use L1. Most papers I know use L2 and I also think it makes most sense. – wannesm Mar 03 '21 at 13:12
  • With respect to fastdtw: by definition fastdtw cannot be more accurate than dtw since the latter is an exact method. It can be, however, that for some use case-related reason fastdtw has certain properties that are beneficial for your particular task (e.g. clustering, classification). You might also be interested in recent work by Wu and Keogh that give some more background on why fastdtw (implementations) are often slower than regular dtw. – wannesm Mar 03 '21 at 13:17
4

Edit: what are the units of the time measurements? I believe that you compared them as they were all in the same unit. Probably the dtw.distance is, for example, in microseconds, while the other answers are in milliseconds, and you thought that dtw.distance performed slower, when it is actually the opposite.

There are different methodologies to measure the distance between two points. It could be based on standard deviation or just euclidian distance. Here is a list of many of those distance.

Some of them might be more computational intensive than others, and also have different meanings. Fast dtw, for example, uses as a third input the type of distance that you want, as described on their github

distance3, path3 = fastdtw(s1, s2, dist = euclidean)

Another reason for the speed difference is the underlying code. Some of them are in pure python, while others are in C, which can be easily 100x faster. A way to speed up your dtaidistance is to set a maximum distance threshold. The algorithm will stop the calculation if it realizes that the total distance will be above a certain value:

distance2 = dtw.distance_fast(array.array('d',s1),array.array('d',s2), max_dist = your_threshold)

It is also important to note that some might be optimized for longer or shorter arrays. Looking at the example below and running it in my computer, I find different results:

from cdtw import pydtw
from dtaidistance import dtw
from fastdtw import fastdtw
from scipy.spatial.distance import euclidean
s1=np.array([1,2,3,4],dtype=np.double)
s2=np.array([4,3,2,1],dtype=np.double)

%timeit dtw.distance_fast(s1, s2)
4.1 µs ± 28.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit d2 = pydtw.dtw(s1,s2,pydtw.Settings(step = 'p0sym', window = 'palival', param = 2.0, norm = False, compute_path = True)).get_dist()
45.6 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit d3,_=fastdtw(s1, s2, dist=euclidean)
901 µs ± 9.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

fastdtw is 219 times slower than dtaidistance lib and 20x slower than cdtw

Felipe Mello
  • 395
  • 2
  • 8
  • First of all: Sorry for the late reply and thank you very much for your answer. Adding the parameter "dist = euclidean" to the fastdtw call did not change the distance value. Are there any other additional parameters I'm missing? – Hagbard Oct 15 '18 at 07:25
  • @Hagbard, probably euclidian is the standard used distance, since it is one of the simplest and quickest, that's why you wouldn't notice any difference. – Felipe Mello Oct 16 '18 at 13:49
  • @Hagbard, i was thinking and I believe that you made a mistake when you were timing it. What are the units of the time? I am a 100% sure that it didn't take 22 seconds for the dtw.distance. I just created two random vector of 3000 points here, and it took 50ms. Probably it said 22, but it is a different time unit, for example, microseconds, instead of miliseconds, and you thought that it took longer than the others – Felipe Mello Oct 16 '18 at 13:56
  • This answer might come a bit as a surprise but the unit of the time measurements is indeed seconds. In order to verify that again I ran the above code with the two sequences s1 and s2 created with s1 = np.random.rand(3000) and s2 = np.random.rand(3000) and got similar results. Is there anything I could be doing wrong here? Concerning the distance metric: Which parameters should I use to obtain results of the same metric (e.g euclidean distance) for all function calls? – Hagbard Oct 17 '18 at 07:17
  • @Hagbard, I am not sure how you could have the same "distance metric" in everyone of them. If you find out on their documentation or something, I would also like to know. Now, concerning to the 22s it took you, could you do a last test? Could you please specify that the numpy array is of type np.double? just do: s1 = np.array(s1, dtype = np.double); s2 = np.array(s2, dtype = np.double) and run on console: %timeit dtw.distance_fast(s1, s2) – Felipe Mello Oct 18 '18 at 20:06
  • 2
    Thanks a lot for staying with me on this issue. Unfortunately, converting the input sequences to double arrays only results in (roughly) the same output. :-( – Hagbard Oct 22 '18 at 07:25