0

I came across this simple problem, but I haven't found my way around it. I have two datasets (DS_clim and DS_yield), which I would like to compare across the three dimensions (time, lat, lon). However, their dimensions are not exactly the same, therefore I thought of using xr.dataarray.where to mask/crop both of them and therefore have the exact same dimensions. Funny enough, the output is still not compatible, with DS_yield having more datapoints than DS_clim. If anyone could help me make them identical in terms of dimension, I would really appreciate. I uploaded both .nc files and below you can find a self-standing piece of code that should replicated it.

Cheers!

Link for downloading the two files: https://drive.google.com/file/d/1gDSoKOY6eFLHqZ4AM0TTr4tXEBu3Y6yM/view?usp=sharing https://drive.google.com/file/d/1ysLqxNz-FBykJS2KojAx0UgTy6Hd9Wc2/view?usp=sharing

import xarray as xr
import pandas as pd

DS_clim = xr.open_dataset('ds_clim.nc')
DS_yield = xr.open_dataset('ds_yield.nc')

DS_clim = DS_clim.where(DS_yield['Yield'] >= 0.0 ) # Remove any grid points not present in the DS_yield
DS_yield = DS_yield.where(DS_clim['mask'] == 1.0 ) # Remove any grid points not present in the DS_clim


df_clim = DS_clim.to_dataframe().dropna()
df_yield = DS_yield.to_dataframe().dropna()

if len( df_clim) == len(df_yield):
    print('Dimensions are equal')
else:
    print('Dimensions are not equal')

EDIT: The DS_yield and DS_clim outputs are:

DS_clim
Out[180]: 
<xarray.Dataset>
Dimensions:  (lat: 70, lon: 84, time: 36)
Coordinates:
  * lat      (lat) float64 -34.75 -34.25 -33.75 -33.25 ... -1.25 -0.75 -0.25
  * lon      (lon) float64 -75.75 -75.25 -74.75 -74.25 ... -35.25 -34.75 -34.25
  * time     (time) int64 1981 1982 1983 1984 1985 ... 2012 2013 2014 2015 2016
Data variables:
    DTR      (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    ETR      (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    PRCPTOT  (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    R10mm    (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    R20mm    (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    Rx1day   (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    Rx5day   (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    SU       (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    TN10p    (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    TN90p    (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    TNn      (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    TNx      (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    TR       (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    TX10p    (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    TX90p    (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    TXn      (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    TXx      (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan
    mask     (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan

DS_yield
Out[181]: 
<xarray.Dataset>
Dimensions:  (lat: 70, lon: 84, time: 36)
Coordinates:
  * lat      (lat) float64 -34.75 -34.25 -33.75 -33.25 ... -1.25 -0.75 -0.25
  * lon      (lon) float64 -75.75 -75.25 -74.75 -74.25 ... -35.25 -34.75 -34.25
  * time     (time) int32 1981 1982 1983 1984 1985 ... 2012 2013 2014 2015 2016
Data variables:
    Yield    (time, lat, lon) float64 nan nan nan nan nan ... nan nan nan nan

and the outputs of the two dataframes are:

df_clim
Out[183]: 
                          DTR        ETR  ...        TXx  mask
lat    lon    time                        ...                 
-33.25 -53.25 1981  10.103154  22.591901  ...  34.204458   1.0
              1982  10.723433  23.566873  ...  34.711117   1.0
              1983   9.179805  20.937945  ...  34.776137   1.0
              1984   9.174395  21.326026  ...  34.636377   1.0
              1985  11.931539  23.326610  ...  35.499480   1.0
                      ...        ...  ...        ...   ...
-22.75 -51.25 2012  11.331343  18.377294  ...  33.616045   1.0
              2013  10.325607  17.657545  ...  32.605069   1.0
              2014  10.945801  18.699326  ...  35.043913   1.0
              2015  10.226426  16.594986  ...  33.634570   1.0
              2016   9.322276  16.398513  ...  33.411968   1.0

[5853 rows x 18 columns]

df_yield
Out[184]: 
                       Yield
lat    lon    time          
-33.25 -53.25 1981  1.687200
              1982  1.669250
              1983  1.532300
              1984  1.133350
              1985  1.215400
                     ...
-22.75 -51.25 2012  2.369826
              2013  2.773502
              2014  1.373870
              2015  2.901679
              2016  2.220938

[5875 rows x 1 columns]

As you can see, the number of rows in each dataframe is different, meaning they are not exactly identical.

Henrique
  • 135
  • 6
  • I think [`.where()`](https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.DataFrame.where.html) is not doing what you think it’s doing. What are you trying to achieve eventually? Also can you provide a few lines / columns of each dataframe [that reproduce the issue](https://stackoverflow.com/help/minimal-reproducible-example)? (see also [this page specifically for pandas](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)) – Cimbali Jul 07 '21 at 17:12
  • The idea is to use 'xr.where' as a mask to cut out the grid cells (in lat,lon,time) that are NA. My objective is to make the two datasets have the exact same dimensions. – Henrique Jul 08 '21 at 08:06
  • Also, I am not really sure how to create the dataframe to replicate the problem because I do not know where the problem is. However, the two links I shared are working. Thank you! – Henrique Jul 08 '21 at 08:13
  • I don’t think anyone is going to download unknown files from an unknown google drive. Show us maybe the output of `DS_clim.head()` and `DS_yield.head()` (or `.sample(10)`) in such a way that you have points from each dataframe that are not in the other one. For example it’s not clear what your `Yield` and `mask` columns are, or where the time / latitude / longitude come into play. Then if you want to be as clear as possible you can also show what the output is that you want. – Cimbali Jul 08 '21 at 09:56
  • Yes, I agree. Sorry about this. I'll update the main question with the DS_yield and DS_clim. Would it help if I share the files on Github? The output that I want is to have the print command showing the dimensions are equal (last line), where lat, lon, time are exactly the same after the 'dropna()' command. – Henrique Jul 08 '21 at 13:05

1 Answers1

1

You could simply use the intersection of indexes:

df_clim = DS_clim.to_dataframe()
df_yield = DS_yield.to_dataframe()

common_idx = df_clim.index.intersection(df_yield.index)
df_clim = df_clim.loc[common_idx]
df_yield = df_yield.loc[common_idx]
Cimbali
  • 11,012
  • 1
  • 39
  • 68