3

Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?

This is a screenshot of a book which contains a good definition of covariagram:

enter image description here

If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.

First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:

from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []

for w in np.arange(len(widths)-1): # for each width
    # for each pairwise distance
    for i in np.arange(distances.shape[0]): 
        for j in np.arange(distances.shape[1]): 
            if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
                m1 = []
                m2 = []
                # when a distance is within a given width, calculate the means of
                # the points involved
                for x in np.arange(distances.shape[1]):
                    if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
                        m1.append(z[x])

                for y in np.arange(distances.shape[1]):
                    if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
                        m2.append(z[y])

                mean_m1 = np.array(m1).mean() 
                mean_m2 = np.array(m2).mean()
                Z.append(z[i]*z[j] - mean_m1*mean_m2)
    Z_mean = np.array(Z).mean() # calculate covariogram for width w
    Cov.append(Z_mean) # collect covariances for all widths

However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:

enter image description here

And it is supposed to look like this:

enter image description here

Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.

r_31415
  • 8,752
  • 17
  • 74
  • 121
  • As written, your equation for `m` doesn't make sense. If you sum over `i`, then it doesn't mean anything to index by `i` outside of the sum (ie, in `m(x_i)` ); that is, there's no `i` on the rhs. – tom10 May 29 '14 at 22:28
  • Right. I thought that was confusing too. If you read the reference I added (this is the exact page http://books.google.com/books?id=nqPgitP5MfYC&lpg=PA204&ots=jA7x0hwGIp&dq="empirical+covariogram"&pg=PA204&redir_esc=y#v=onepage&q=%22empirical%20covariogram%22&f=false) at the end, it seems the sum is over the points located within distance h, not over the same i. That is, N(h) in the first equation, is not the same than N(h) in the second equation. As I said, the book has a better notation than Wikipedia. – r_31415 May 29 '14 at 22:49
  • I don't think this is as hard as you think, but without the right equation it's difficult to know. I don't exactly want to read a book chapter to figure out the right equation either. I'll wait a bit, but eventually I'll vote to close as unclear. Basically, though, you just want to sum over pairwise measures, which you can probably do easily enough in numpy. But first you need to know what you need to do. – tom10 May 29 '14 at 23:15
  • Well, I can try to change the equation. By the way, if you click the link in my previous comment, it leads to the equation. Let me know how I can improve the question, instead of voting to close. – r_31415 May 29 '14 at 23:19
  • Also, about how many points do you have? ie, does it makes sense to calculate all interpoint distances? – tom10 May 29 '14 at 23:20
  • In the example, I think there are 150 points. However, it is supposed to be used with thousands of points. I'm not sure if it's possible to calculate a covariogram/variogram without taking into consideration every pairwise distance. – r_31415 May 29 '14 at 23:25
  • By the way, I already changed the equations. – r_31415 May 29 '14 at 23:28
  • OK. Thanks for the changes. Now it seems (to me) like a programming question (ie, how do I represent this idea in a program) rather than a math one, and it makes sense. – tom10 May 29 '14 at 23:49

1 Answers1

5

One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.

First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.

Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.

from numpy import *
import random
import matplotlib.pyplot as plt

S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
#     density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
#     make a point cloud motivated by this density
random.seed(10)  # so this can be repeated
points = []
while len(points)<N:
    v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
    if True: #v<density[ix,iy]:
        points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
#     build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]] 
                         for i in range(N) for j in range(N) if i>j])

Which gives:

enter image description here

The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.


Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.

Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).

# now do the real calculations for the covariogram
#    sort by h and give clear names
i = argsort(m[:,0])  # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh

hvals = linspace(0,S,1000)  # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
    start, stop = ihvals[i-1], ihval
    N = stop-start
    if N>0:
        mnh = sum(zh[start:stop])/N
        mph = sum(zsh[start:stop])/N
        szz = sum(zz[start:stop])/N
        C = szz-mnh*mph
        result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()

enter image description here

which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.

The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.

tom10
  • 67,082
  • 10
  • 127
  • 137
  • Thanks! I will take a look at this in a moment. – r_31415 Jun 05 '14 at 02:00
  • +1. Thank you for your answer. You seem to be interpreting m_+h and m_-h as the average value below a distance h. Maybe you're right but I interpreted it as the average value within distance h for each value inside the sum in the definition of Cov(h). Obviously, my interpretation requires a much more intensive calculation. Do you know some reference where I can confirm your interpretation? – r_31415 Jun 05 '14 at 06:27
  • Another issue is the definition of N(h). I think the correct definition is the set of pairs within distance h, but not including previous distances. Otherwise, the covariogram is almost always zero as we increase the lag because N(h), at that point, comprises almost every pair of points. Furthermore, several implementation I have seen, include a threshold value, in order to capture points between h-e and h+e where e is a tolerance value. – r_31415 Jun 05 '14 at 06:28
  • @Robert: I think you're right about N(h), using the definition, it's not cumulative as I have it, so sums should be over zh[ihval-1:ihval], where your "tolerance value" is this span of h. (And doesn't this issue also answer the difference in m and the "intensive calculation", or is there something else that you think is different?) I'll try to change my answer when I get a chance. – tom10 Jun 05 '14 at 14:01
  • Sure, no problem. As for the difference in m, I think it still can be interpreted as you did, an average over pair of values separated by distance h (given a tolerance value) or as an average over each value considered in the sum of Cov(h). – r_31415 Jun 06 '14 at 18:56
  • @RobertSmith: I've made the edits, including a distribution that should give negative correlation at large h. I still don't understand what other interpretation you refer to for `m` though. I think mine is reasonable though since it's similar to how the mean usually falls out of the variance calculation. (Also, though, if you have the full derivation you should be able to check for the correct interpretation, which should be based on the math rather than the words surrounding it, or you could ask at the stats stack exchange.) – tom10 Jun 08 '14 at 18:30
  • Thank you. After looking at the [definition](http://mathworld.wolfram.com/Covariance.html) I think your interpretation is correct because the terms refer to the mean of the random variable. Therefore, the sample mean should be used as an approximation (assuming ergodicity, I suppose.) – r_31415 Jun 09 '14 at 03:02