2

I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.

I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.

I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.

So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend

EDIT:

To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.

EDIT 2:

I'm now running into a different problem. Going off of the suggested code, I have the following: for i in range(145): for j in range(192): out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05) print out

I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple

My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.

EDIT 3:

One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.

Alex Morrison
  • 149
  • 1
  • 13
  • I don't really know the science here but it's possible you might just need to throw hardware at the problem. That said, chances are there are ways you can avoid having all that data in memory simultaneously by breaking the problem up into smaller chunks and then combining those chunks. – Iguananaut Oct 20 '17 at 19:58

4 Answers4

1

I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:

from time import time

import numpy as np
from mk_test import mk_test

data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
    for j in range(192):
        out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')

Elapsed Time: 35.21990394592285 s

My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.

Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:

array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')

One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.

Grr
  • 15,553
  • 7
  • 65
  • 85
  • I took your code and I ended up getting the following error: `TypeErrorTraceback (most recent call last) in () 9 for i in range(145): 10 for j in range(192): ---> 11 out[i,j] = mk_test(data[:, i, j], alpha=0.05) 12 #print(f'Elapsed Time: {time() - start} s') 13 TypeError: 'tuple' object does not support item assignment` I did modify your code a little bit to fit mine, and got the same error. When you remove `[i,j]` from out[i,j] it gets rid of the error, but doesn't make an array of the full data. – Alex Morrison Oct 20 '17 at 22:20
  • It sounds like the way you built the `out` array is giving you a tuple instead of an array/list. Can you post the modified code? – Grr Oct 24 '17 at 15:09
1

Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.

The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.

`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
    for j in range(193):
        out1 = yrmax[:,i,j]
        out = np.append(out, out1, axis=0) `

Then I reshaped the resulting array (out) as follows:

out2 = np.reshape(out,(27840,46))

I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.

I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.

`x = range(46)
 y = np.zeros((0))
for j in range(27840):
    b = sc.stats.kendalltau(x,out2[j,:])
    y = np.append(y, b, axis=0)`

Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.

Thanks everyone for the assistance!

Alex Morrison
  • 149
  • 1
  • 13
0

Depending on your situation, it might just be easiest to make the arrays.

You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:

SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
    for y in range(SIZE[1]):
        coord_trend = map(lambda d: d[x][y], year_matrices)
        result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix

Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.

Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):

year_arrays = [
    ['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
    ['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
    ['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4


coord_arrays = zip(*year_arrays)      # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`


# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')

flat_result = map(analyze_trend, coord_arrays)

The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.

Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.

There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)

Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.

natevw
  • 16,807
  • 8
  • 66
  • 90
0

I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.

The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.

The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.

Below are the 2 functions:

import copy
import numpy as np
from scipy.stats import norm

def countTies(x):
    '''Count number of ties in rows of a 2D matrix

    Args:
        x (ndarray): 2d matrix.
    Returns:
        result (ndarray): 2d matrix with same shape as <x>. In each
            row, the number of ties are inserted at (not really) arbitary
            locations.
            The locations of tie numbers in are not important, since
            they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
    
    Inspired by: https://stackoverflow.com/a/24892274/2005415.
    '''
    if np.ndim(x) != 2:
        raise Exception("<x> should be 2D.")

    m, n = x.shape
    pad0 = np.zeros([m, 1]).astype('int')

    x = copy.deepcopy(x)
    x.sort(axis=1)
    diff = np.diff(x, axis=1)

    cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
    absdiff = np.abs(np.diff(cated, axis=1))

    rows, cols = np.where(absdiff==1)
    rows = rows.reshape(-1, 2)[:, 0]
    cols = cols.reshape(-1, 2)
    counts = np.diff(cols, axis=1)+1
    result = np.zeros(x.shape).astype('int')
    result[rows, cols[:,1]] = counts.flatten()

    return result

def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
    '''Vectorized Mann-Kendall tests on 2D matrix rows/columns

    Args:
        data (ndarray): 2d array with shape (m, n).
    Keyword Args:
        tails (int): 1 for 1-tail, 2 for 2-tail test.
        axis (int): 0: test trend in each column. 1: test trend in each
            row.
    Returns:
        z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
            corresponding to data in each row in <x>.
            If <axis> = 1, 1d array with length <m>, standard scores
            corresponding to data in each column in <x>.
        p (ndarray): p-values corresponding to <z>.
    '''

    if np.ndim(data) != 2:
        raise Exception("<data> should be 2D.")

    # alway put records in rows and do M-K test on each row
    if axis == 0:
        data = data.T

    m, n = data.shape
    mask = np.triu(np.ones([n, n])).astype('int')
    mask = np.repeat(mask[None,...], m, axis=0)
    s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
    s = (s * mask).sum(axis=(1,2))

    #--------------------Count ties--------------------
    counts = countTies(data)
    tt = counts * (counts - 1) * (2*counts + 5)
    tt = tt.sum(axis=1)

    #-----------------Sample Gaussian-----------------
    var = (n * (n-1) * (2*n+5) - tt) / 18.
    eps = 1e-8  # avoid dividing 0
    z = (s - np.sign(s)) / (np.sqrt(var) + eps)
    p = norm.cdf(z)
    p = np.where(p>0.5, 1-p, p)

    if tails==2:
        p=p*2

    return z, p

I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.

To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).

Below is the computation:

x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
        2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
        4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
        1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])

# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
    for j in range(arr.shape[2]):
        arr[:, i, j] = x

print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)

import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)

The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.

And it took me only 1.28 seconds to compute that.

Jason
  • 2,950
  • 2
  • 30
  • 50