0

There is an increasing huge abundance of weather data available on cloud buckets. Awesome! However they are not stored in cloud optimized formats. I was wondering if there was a way to only pull metadata from grib2 files stored on AWS, and subsequently only pull single points from those files. Same question for netcdf4. I know Netcdf4 support libraries allow you to do so for files on disk, but I have no idea how to do it on the cloud.

I'm at a loss for which resources I should be looking into in order to explore this. Any help would be really appreciated.

Jon E
  • 99
  • 7
  • This is really sad because e.g. google does advertising for big query with satellie data but they are just stored in a google file systems storage. The data intensive processes has to be made locally after downloading the files. I am sure that there is no cloud based solution for geospatial datasets at the moment after searching for a long time. So you have to download the data and use cfgrib/xarray to `grid_data.sel(lat=49.8, lon=9.8, method='nearest')` or use a public weather API service. – dl.meteo Mar 10 '21 at 06:48

1 Answers1

1

You could parse the GRIB2 file on-the-fly and drop everything you don't need right away. Each GRIB2 file contains one or more GRIB2 messages which has the following structure:

  • Section 0: Indicator Section
  • Section 1: Identification Section
  • Section 2: Local Use Section (optional)
  • Section 3: Grid Definition Section (can be repeated)
  • Section 4: Product Definition Section (can be repeated)
  • Section 5: Data Representation Section (can be repeated)
  • Section 6: Bit-Map Section (can be repeated)
  • Section 7: Data Section (can be repeated)
  • Section 8: End Section

Section 0 has in GRIB2 always 16bytes, Section 8 always 4 bytes. The rest has always starts with length of the section (4 bytes) and section number (1 byte). Therefore it should be easy to skip all section you don't need fast. You can then read only section 1, 3 or 5, depending on what meta-data you want.

There is a drawback however. If I understand it correctly you want to do that on online resources. In this case you will download the whole file while skipping over some or most of its parts.

If you are trying to build some kind of index of available GRIB data this will be probably one options. Kind of a GRIB crawler.

Note that GRIB1 has a bit different structure

More details about GRIB2 sections: https://www.nco.ncep.noaa.gov/pmb/docs/grib2/grib2_doc/

Jan Kubovy
  • 421
  • 5
  • 11
  • So if I understand you correctly, say a grib file uses second order spatial differencing as its compression (which is definitely the most common for noaa files), then I would HAVE to download the entire section 7 is order to pull a single point? Edit: can you also point me to resources as to how the encoding/decoding algorithms work? I've seen mentions of h1, h2, hmin, and g1, h2, but nothing about how to implement them in a decoding algorithm – Jon E Mar 13 '21 at 01:44
  • 1
    Yes, you need to process whole section 7 and section 5 to actually know what the encoding is. Check https://github.com/kubovy/JGribX/blob/kotlin/src/main/java/mt/edu/um/cf2/jgribx/grib2/Grib2RecordDS3.kt for Complex Packing and Spatial Differencing and https://www.yumpu.com/en/document/view/11723135/guide-to-wmo-table-driven-code-forms section 2.3.2 describes Complex Packing and Spatial Differencing. You will need to download all the data. Just while reading the downloading stream you can fast skip sections you don't need – Jan Kubovy Mar 13 '21 at 07:29
  • I recommend to not build a parser for grib data on your own. There are many parser available e.g. PyNIO, pygrib and cfgrib. What you should know about cfgrib+xarray is that cfgrib just loads a file-index and not the whole file into memory. This is why open_dataset is lightning fast to check out metadata. – dl.meteo Mar 15 '21 at 09:15
  • @dl.meteo cfgrib produces different idx files from what is available for most NOAA hosted grib data. The available idx files don't contain enough metadata to be able to know coordinates and as such, you would still need to pull sections 1-6 from all wanted files in order to be able to get the proper metadata, and even then, I don't think cfgrib is smart enough that you can feed it sections 1-6 without having the full file file handy. For locally hosted data, I absolutely agree with your approach, however with cloud hosted data I think we need something else. – Jon E Mar 17 '21 at 19:31