2

I was going through the a scipy code for ks test (2 sample) which calculates the maximum distance between CDF's of any two given samples. code for calculating the cumulative Distribution Function(CDF).

I fail to understand the logic in the lines for calculating cdf. First, data1 and data2 is sorted and then using np.searchsorted we are trying to find the position of data_all in both data1 and data2. data_all is nothing but concatenation of sorted data1 and data2.

What if, the min value of data2 is below data1. Doesn't that violate the assumption that cdf shouldn't be decreasing with value

data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
RTM
  • 759
  • 2
  • 9
  • 22

1 Answers1

1

It's true that data_all is not sorted in general, but this does not matter for the computation.

  • The array cdf1 holds the values of the CDF of the first sample, computed at each of the points data_all
  • The array cdf2 holds the values of the CDF of the second sample, computed at each of the points data_all

Then the code does

np.max(np.absolute(cdf1 - cdf2))

taking the maximum of the differences of these. When you find the maximum of numbers, it does not matter in what order you look at them.

So, the order of these two arrays does not matter, as long as it's consistent: cdf1[42] is the value of CDF1 at some point and cdf2[42] is the value of CDF2 at the same point.

  • While computing maximum absolute difference of cdf1 and cdf2 difference, if there is some intersection b/w the modes of histograms of data1 and data2, wouldn't that affect the final score value(the maximum difference) – RTM Aug 03 '18 at 01:48