I am working on ERA5 reanalysis data in NetCDF format, and I need to compute wind direction based on U and V components.
I already have a working piece of Python code to do so using xarray and pandas, and saving output as NetCDF.
While I do not add any additional data (only applying transformation to compute a new variable instead of U and V), I'm getting a huge increase in the size of the ouput file. Input files are about 30 Mo and the output one is about 300 Mo.
Can someone explain what is going on ? There must be something with the way my output file is being shaped, but I don't get how. Is it due to the encoding of the file, datatype used (float32 in both input and output here), or any other NetCDF format issue ?
Also, do you have any idea how I could optimize the size of the output file ?
To help you understand the differences, here is a summary of the input file :
<xarray.Dataset>
Dimensions: (latitude: 7, longitude: 8, time: 113952)
Coordinates:
number int64 ...
* time (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
step timedelta64[ns] ...
surface float64 ...
* latitude (latitude) float64 47.5 47.25 47.0 46.75 46.5 46.25 46.0
* longitude (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
valid_time (time) datetime64[ns] ...
Data variables:
u10 (time, latitude, longitude) float32 ...
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2022-05-04T13:10 GRIB to CDM+CF via cfgrib-0.9.9...
And the output one :
<xarray.Dataset>
Dimensions: (latitude: 7, longitude: 8, time: 113952)
Coordinates:
* time (time) datetime64[ns] 1979-01-01 ... 1991-12-31T23:00:00
* latitude (latitude) float64 46.0 46.25 46.5 46.75 47.0 47.25 47.5
* longitude (longitude) float64 -3.0 -2.75 -2.5 -2.25 -2.0 -1.75 -1.5 -1.25
Data variables:
number (time, latitude, longitude) int64 0 0 0 0 0 0 0 ... 0 0 0 0 0 0
step (time, latitude, longitude) timedelta64[ns] 00:00:00 ... 00:0...
surface (time, latitude, longitude) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
valid_time (time, latitude, longitude) datetime64[ns] 1979-01-01 ... 199...
direction (time, latitude, longitude) float32 17.76 26.89 ... 180.1 178.5
Only difference I can see is that some of the original coordinates are now data variables.