Implementing a 2-sample KS test with 3D data in Python

Question

I have two 3D distributions and I want to run a Kolmogorov–Smirnov test on these two samples to measure their similarity. scipy.stats has an implementation of a 2-sample K-S tests implemented in 1 dimension and I found an implementation in 2 dimensions, but none in 3 dimensions (or N-dimensions).

Can someone implement a 2-sample K-S test for 3D distributions?

Possible duplicate of [Two-sample Kolmogorov-Smirnov Test in Python Scipy](http://stackoverflow.com/questions/10884668/two-sample-kolmogorov-smirnov-test-in-python-scipy) — Serenity, May 26 '16 at 22:53
Hi Stanley, that question refers to using scipy.stats.ks_2samp(), and I've linked to that function in my question. That function only handles 1D distributions. My question is about 3D distributions, which I feel I've made abundantly clear in the title and 3 mentions in the body-- can you please remove the duplicate flag? — crypdick, May 26 '16 at 23:46
@roving Flags will automatically be invalidated after 20 or so days. — Bhargav Rao, May 27 '16 at 13:13

score 1 · Accepted Answer · answered May 27 '16 at 22:37

1

The KS test is not easily generalized to multiple dimensions; see the Wikipedia article on the KS test on that question. Even if you can find or create a suitable generalization, I wonder if you really want to do that, as significance testing is generally useless on large data sets.

If you want to quantify the difference between distributions, my advice is to consider entropy-based quantities such as mutual information or the Kullback-Leibler divergence.

Maybe you can say more about what your goals are here.

answered May 27 '16 at 22:37

Robert Dodier

16,905
2
31
48

Sure! I have some insect flight data inside a windtunnel. I have a dynamical model of their flight. To fit my model's parameters, I'm running an optimization algorithm, which simulates an ensemble of trajectories and scores them against the observed flight ensembles. – crypdick May 27 '16 at 22:59
To score each ensemble, I took the components of each kinematic (velocity x,y,z, accels xyz, positions xyz, curvature) and computed the Kullback-Liebler divergence. This gave me problems with the DKL of curvature (the curvatures are distributed with a long-tail, and the DKL penalizes if the bins in the tails aren't exactly the same). That's why I decided to switch to a KS test, which got me around the binning issues. A 3D KS test would give me a statistical score telling me whether an ensemble of trajectories is distributed in the same way as a reference ensemble. – crypdick May 27 '16 at 23:05
I forgot to mention, for my final score, I simply took the sum of all my DKLs sum(DKL(velocity_xcomponent_targ, velocity_xcomponent_ref) + DKL(velocity_ycomponent_targ, velocity_ycomponent_ref) + ...) or more recently, sum(ks_2samp(kinematic components)) – crypdick May 27 '16 at 23:09
Since you are modeling flights, surely the trajectories matter and not just the distribution of insects at a point in time. Perhaps then an appropriate goodness of fit is the distance of simulated trajectory from an actual trajectory with the same initial conditions. – Robert Dodier May 28 '16 at 16:53
To clarify a little, we're not interested in simulating trajectories that exactly retrace any given individual flight. We're interested in how the insects change their navigational strategy in the presence of various stimuli, and see if we can make any claims about which characteristics of the plume stimulus they respond to. We have a model of their "baseline random walk", and after I fit that I later want to add stimulus-triggered decision policies to see which ones best explain differences between control and experimental conditions. – crypdick May 28 '16 at 17:28
Two things: 1) unlike many insects, they aren't particularly robust fliers, and don't exhibit robust discrete behaviors (crosswind casting, surging upwind, etc.). They fly like they're drunk. They just seem to bias their "random walk" towards the sources. 2) In all conditions, even controls, they have a strong bias to fly along certain parts of the tunnel. My "baseline random walk" model has a parameter which captures that behavior, and I can't see how to fit the parameter without the score function scoring position distributions. – crypdick May 28 '16 at 17:32
1

OK, that makes sense, I see why you are focusing on distributions. Note that KS compares measures, so the generalization you need is to compute the measure (i.e., total mass) of arbitrary subsets and look for the maximum difference. Probably for your purposes it's enough to consider only spheres, ellipsoids, or boxes or something like that. I don't think this can be reduced to repeated 1-d problems. You might end up writing some code for this, but if not, at least this gives an idea of what to look for. I'll try to spell it out in greater detail if it's interesting. – Robert Dodier May 28 '16 at 18:28
Robert, I'm not sure what you're suggesting. Divide the windtunnel into smaller volumes and then use ks2samp? – crypdick Aug 09 '16 at 17:02

Implementing a 2-sample KS test with 3D data in Python

1 Answers1