2

I have a NETCDF4 file which doesn't grow beyond 2GB.

I am using the following sample data - I am converting over 200 txt files to netcdf4 file

STATIONS_ID;MESS_DATUM;  QN;FF_10;DD_10;eor
       3660;201912150000;    3;   4.6; 170;eor
       3660;201912150010;    3;   4.2; 180;eor
       3660;201912150020;    3;   4.3; 190;eor
       3660;201912150030;    3;   5.2; 190;eor
       3660;201912150040;    3;   5.1; 190;eor
       3660;201912150050;    3;   4.8; 190;eor

The code looks like:

files = [f for f in os.listdir('.') if os.path.isfile(f)]
count = 0 
for f in files:

    filecp = open(f, "r", encoding="ISO-8859-1")
    
    
# NC file setup
    mydata = netCDF4.Dataset('v5.nc', 'w', format='NETCDF4')
    
    mydata.description = 'Measurement Data'
    
    mydata.createDimension('STATION_ID',None)
    mydata.createDimension('MESS_DATUM',None)
    mydata.createDimension('QN',None)
    mydata.createDimension('FF_10',None)
    mydata.createDimension('DD_10',None)
    
    STATION_ID = mydata.createVariable('STATION_ID',np.short,('STATION_ID'))
    MESS_DATUM = mydata.createVariable('MESS_DATUM',np.long,('MESS_DATUM'))
    QN = mydata.createVariable('QN',np.byte,('QN'))
    FF_10 = mydata.createVariable('FF_10',np.float64,('FF_10'))
    DD_10 = mydata.createVariable('DD_10',np.short,('DD_10'))
    
    STATION_ID.units = ''
    MESS_DATUM.units = 'Central European Time yyyymmddhhmi'
    QN.units = ''
    FF_10.units = 'meters per second'
    DD_10.units = "degree"
    
    txtdata = pd.read_csv(filecp, delimiter=';').values
    
    #txtdata = np.genfromtxt(filecp, dtype=None, delimiter=';', names=True, encoding=None)
    if len(txtdata) > 0:
        
        df = pd.DataFrame(txtdata)

        sh = txtdata.shape
        print("txtdata shape is ", sh)
    
        mydata['STATION_ID'][:] = df[0]
        mydata['MESS_DATUM'][:] = df[1]
        mydata['QN'][:] = df[2]
        mydata['FF_10'][:] = df[3]
        mydata['DD_10'][:] = df[4]
    
        
    mydata.close()
    filecp.close()
    count +=1
talonmies
  • 70,661
  • 34
  • 192
  • 269
super
  • 23
  • 3
  • Are you using 32 bit python? – talonmies Jun 21 '21 at 05:12
  • - how to check this?- @talonmies – super Jun 21 '21 at 05:16
  • https://stackoverflow.com/q/1405913/681865 – talonmies Jun 21 '21 at 05:19
  • @talonmies - i am using mac OS :( – super Jun 21 '21 at 05:22
  • python -c "import ctypes; print(32 if ctypes.sizeof(ctypes.c_voidp)==4 else 64, 'bit CPU')" >>> 64 bit CPU – super Jun 21 '21 at 05:26
  • @talonmies I ran the following in jupyter notebook : ' import sys from math import log log(sys.maxsize, 2) ' **output** **63.0** [reference] (https://asmeurersympy.wordpress.com/2009/11/13/how-to-get-both-32-bit/) – super Jun 21 '21 at 05:29
  • Hi, you load all the data into memory with pandas, did you check that the limit is not there? You can use pandas IO with chunks as well, i.e. no need to read the full file into memory. https://pandas.pydata.org/pandas-docs/dev/user_guide/io.html#io-chunking If you provide a working example with csv file generator, it might be easier to debug. One file would be enough. – kakk11 Jun 21 '21 at 06:58
  • @kakk11 i will try this. I have one more doubt What about the **dimensions** Is it ok to be **unlimited or none**? or is there a problem with that? – super Jun 21 '21 at 07:13

1 Answers1

1

Your problem is that you create the same file in the loop. So your file size is limited to the biggest initial data file.

Open the file once, and add each new data to the end of netcdf data arrays.

If you get 124 values in the first file, you put:

mydata['STATION_ID'][0:124] = df[0]

and you get 224 from the second file, you put

mydata['STATION_ID'][124:124+224] = df[0]

So, in case data files are downloaded from https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/10_minutes/wind/recent/ to <text file path>

import netCDF4
import codecs
import pandas as pd
import os
import numpy as np


mydata = netCDF4.Dataset('v5.nc', 'w', format='NETCDF4')
mydata.description = 'Wind Measurement Data'
mydata.createDimension('STATION_ID',None)
mydata.createDimension('MESS_DATUM',None)
mydata.createDimension('QN',None)
mydata.createDimension('FF_10',None)
mydata.createDimension('DD_10',None)

STATION_ID = mydata.createVariable('STATION_ID',np.short,('STATION_ID'))
MESS_DATUM = mydata.createVariable('MESS_DATUM',np.long,('MESS_DATUM'))
QN = mydata.createVariable('QN',np.byte,('QN'))
FF_10 = mydata.createVariable('FF_10',np.float64,('FF_10'))
DD_10 = mydata.createVariable('DD_10',np.short,('DD_10'))

STATION_ID.units = ''
MESS_DATUM.units = 'Central European Time yyyymmddhhmi'
QN.units = ''
FF_10.units = 'meters per second'
DD_10.units = "degree"    
fpath = <text file path>
files = [f for f in os.listdir(fpath)]
count = 0 
mydata_startindex=0
for f in files:
    filecp = open(fpath+f, "r", encoding="ISO-8859-1")
    txtdata = pd.read_csv(filecp, delimiter=';')
    chunksize = len(txtdata)
    if len(txtdata) > 0:          
        mydata['STATION_ID'][mydata_startindex:mydata_startindex+chunksize] = txtdata['STATIONS_ID']
        mydata['MESS_DATUM'][mydata_startindex:mydata_startindex+chunksize] = txtdata['MESS_DATUM']
        mydata['QN'][mydata_startindex:mydata_startindex+chunksize] = txtdata['  QN']
        mydata['FF_10'][mydata_startindex:mydata_startindex+chunksize] = txtdata['FF_10']
        mydata['DD_10'][mydata_startindex:mydata_startindex+chunksize] = txtdata['DD_10']
        mydata_startindex += chunksize
kakk11
  • 898
  • 8
  • 21
  • to me, it seems easier if the OP would concatenate the DataFrames created from the text files and *then* put the columns into the nc dataset in one final step – FObersteiner Jun 21 '21 at 09:56
  • 1
    @MrFuppes That would work only as long the all the dataframes fit to memory, but it might be simpler indeed. As NetCDF4 has good support for chunking and appending, I think partial write is not a bad exercise. For production code, I would use dask and xarray anyway. – kakk11 Jun 21 '21 at 10:01
  • @kakk11 the chunksize is causing an issue with the second iteration.. the first iteration with one file - it runs fine. What is the issue? Could you help? Also on what basis do we decide on the chunksize? – super Jun 21 '21 at 12:23
  • @MrFuppes : please help me with the above query if you could – super Jun 21 '21 at 12:23
  • UPDATE: works fine with the chunksize for any number of iterations QUERY: on what parameters the chunksize depends on? how does it vary in the real time? – super Jun 21 '21 at 12:49
  • There is a confusion in terminology, chunksize as used in my example is just a portion of data that is read from each new file. And it's size is computed from the new csv file, each time. This is different from netCDF-s internal `chunksize`, which is often used for IO optimizations. But I doubt it is needed here, though it might make writing a little faster. – kakk11 Jun 21 '21 at 17:15