Correlation of gridded time series avoiding NANs

Question

I am looking for a way to compute the correlation of two gridded time series. Both have the same shape of (432,55,144) which is (time steps, latitude, longitude). As you can see in the following picture, I was already successful with it and got a two dimensional array with all the correlation coefficients by:

corrvalue = []
if data1.shape==data2.shape:
    corrcoefMatrix = [[0 for i in range(len(longitudes))] for j in range(len(latitudes))] 
    for x in range(len(latitudes)):
        for y in range(len(longitudes)):
            corrvalue = np.corrcoef(data1[:,x,y],data2[:,x,y])
            corrcoefMatrix[x][y] = corrvalue[0,1]

        corrcoefMatrix = np.squeeze(np.asarray(corrcoefMatrix))

However, there are some NANs causing the white missing value spots. Even though there is only one missing value in the 432 long time series, the correlations coef is NAN. According to this post pandas seems to be the best choice. However, it only accepts two dimensional arrays, so I transformed my data by using Jarads answer from this post

df1 = pd.DataFrame([list(l) for l in data1]).stack().apply(pd.Series).reset_index(0,drop=True)
df2 = pd.DataFrame([list(l) for l in data2]).stack().apply(pd.Series).reset_index(0,drop=True)

and using df.corrwith(df2). This gave me only a one dimensional 144 long array, not a 55x144 one as I want to. There must be a fairly simple way since such correlations with missing values are used quite often but it's not well documented or I just cannot find it.

Be more specific... what's the question exactly? Otherwise you'll just get pointed to [something like this](https://pandas.pydata.org/pandas-docs/stable/missing_data.html) — John Mee, Jun 02 '17 at 04:59
How do I produce a correlation coefficient matrix without using the NAN values in the calculation to avoid blank cells like those in the map shown above e.g northern Africa? — David Hoffmann, Jun 02 '17 at 05:45
What are you going to fill the holes with? They appear to correspond to areas of the planet that, more than likely, have no data. — John Mee, Jun 02 '17 at 06:01
Each grid cell has its time series. It's that some of them contain a few NANs. For example the grid cell at latitude index 40 and longitude index 90 (`data1(:,40,90)`) has 432 values of one variable. The grid cell in data2 is the same. However, if one of the arrays has one NAN, the whole correlation is NAN. So there is enough data to perform the correlation. The calculations only need to avoid the NANs. — David Hoffmann, Jun 02 '17 at 08:07
Did you read through ["Working with missing data"](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)? — John Mee, Jun 03 '17 at 05:30
There are a lot of options listed in there. I haven't done it myself so am not going to be much help. Perhaps someone who does have specific experience with this will chime in. That's more likely if you can rewrite/restart the question in the terms of the documentation... "the docs say... which I've done like this... but I'm struggling with... because I'm trying to achieve..." You might've moved on by now, but either way, Good luck! — John Mee, Jun 03 '17 at 05:36
@JohnMee Yes, I've read through this. I overcame my problem by using masked array. It does not take invalid data like NAN into account. `np.ma.corrcoef(array1[:,x,y],array2[:,x,y])` But the documentation about "Working with missing data" was helpful indeed. Thanks for your help — David Hoffmann, Jun 06 '17 at 03:24

Correlation of gridded time series avoiding NANs

0 Answers0