How to handle large volume of data in a single array using xarray?

Question

I have 16 years of daily meteorological data in NetCDF, it has and each data contain a grid size of 501 x 572. This means each year has dimensions of 365 x 501 x 572. I converted it into a one-dimensional array. Then I am trying to plot probability distribution. But since the data size is so large, the python kernel restarts. How to optimize my code to convert 16 (years) x 365 (days) x 501 (lat) x 572 (lon) into a single array to plot distribution? I used chunks to optimize the input, but still, it fails when I convert it into a single array. It shows kernel restarts on the laptop. How to do it? How can I handle this much of data using xarray?

import matplotlib.pyplot as plt
import xarray as xr
import numpy as np
import seaborn as sns

fname='20*.nc'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
ds=xr.open_mfdataset(fname,parallel=True,chunks=100)
prec = ds.irwin_cdr.values.flatten()
sns.displot(prec, bins=50,  color="g", ax=ax)

First, I believe using `ds.irwin_cdr.values` will transform your dask array into a numpy array, so it will load all your data into memory rendering useless the chunking done. Xarray should be fine with working on large dataset as long as you keep the data as dask arrays. Then, you should probably chunk with at least 1000x1000 element as stated here http://xarray.pydata.org/en/stable/user-guide/dask.html#chunking-and-performance. Finally, I don't know seaborn but they probably provides an api to handle large dataset. — Abel, Aug 19 '21 at 08:26
You could investigate Datashader which is made for visualising big datasets and will be designed to handle it. — creanion, May 30 '22 at 18:23

Huite Bootsma · Accepted Answer · 2022-12-06T15:38:26.207

You mention:

16 (years) x 365 (days) x 501 (lat) x 572 (lon)

This is equal to 1.67E9 values. Assuming they're float64, that's eight byte per value, i.e. 13.4 gigabyte of RAM. That's challenging. You could half the RAM usage by converting to float32.

Xarray has many tools for dealing with large data, calling .values will turn it into it a numpy array and load everything into memory. I'm not sure what the displot does behind the scenes (you can read its source if you want to): but it seems like you want to compute some kind of histogram.

In that case, your real problem is: how do I compute a histogram of a very large array? -- Fortunately, that question has been answered already:

Numpy histogram of large arrays

Compute a histogram piece by piece (np.histogram returns the edges + counts), then sum all the counts.

ds = xr.open_mfdataset(fname, parallel=True, chunks="auto")
da = ds["irwin_cdr"]
step = ...
bins = np.arange(da.min(), da.max() + step, step)

hist, _ = np.histogram(da.isel(time=0).values.ravel())
for i in range(1, len(da["time"]):
    hist += np.histogram(da.isel(time=i).values.ravel())[0]

# Now do your plotting with bins, hist

This will only read a single timestep into memory at a time.

There's a gist here on how to use seaborn with precomputed histograms: https://gist.github.com/pierdom/d639a1d3b8934ee31db8b2ab9997ae92

I reckon this might do the trick:

bin_midpoint = 0.5 * (bins[:-1] + bins[1:])
sns.histplot(x=bin_midpoint, weights=hist, discrete=True)

nitpick first sentence: float64 is 8 bytes per element. – creanion May 30 '22 at 18:21 — creanion, May 30 '22 at 18:21
Thanks, silly mistake on my part, I've corrected the post – Huite Bootsma Dec 06 '22 at 15:40 — Huite Bootsma, Dec 06 '22 at 15:40

How to handle large volume of data in a single array using xarray?

1 Answers1