0

I want to attach some meta data with a pandas dataframe. The meta data is like adding some description about what all data processing was done before saving the dataframe.

I came across this solution: https://stackoverflow.com/a/52546933/214526

So, I have tried followings:

xarrDS: xarray.Dataset = pdDF.to_xarray()
xarrDS.attrs["description"] = "Some description about data processing"
## here if I display xarrDS in the notebook, it shows the data correctly"

xarrDS.to_netcdf(path="processed_df.nc")

But this save to netcdf causes this exception:

ValueError: setting an array element with a sequence

The pandas dataframe does not have any NaN values. I do not find any relevant solutions online. I see that this article also is saving it using similar code.

Any pointer to how to resolve this or alternative solution (without using additional mlops libraries) to save the meta data will be appreciated.

My versions for the libraries are following:

pandas=1.5.3
xarray=2022.11.0
netcdf4=1.6.3
soumeng78
  • 600
  • 7
  • 12

1 Answers1

1

The likely reason for that error is that in your pandas dataframe you have some columns which are of type object, so something like columns with strings. So the automatic conversion might have some problems assigning that datatype to one of the supported NetCDF4 datatypes.

I tested it myself, strings work without any issue. What will give you problems are columns that have lists or arrays in the cells. And here you are out of luck, because the netCDF4 specification simply does not support saving those datatypes.

data = {
  "calories": [420, 380, 390],
  "duration": [50.4, 40.2, 45.7],
  "type": ["a", "foo", "bar10"],
  # "arrays": [np.arange(4), np.arange(3), np.arange(2)],
  "lists": [[1,2], [3,4], [5,6]]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)
df.dtypes

You can try it out without the lists and arrays column, which will work. But with one of them it will give you the error you are getting:

ds = df.to_xarray()
ds.to_netcdf("test.nc")

For that it doesn't matter if you saved an attribute or not.

JonasV
  • 792
  • 5
  • 16
  • thank you. Yes, it does have few columns with strings. I'll experiment if the save is successful w/o those columns and update. – soumeng78 Apr 27 '23 at 15:42
  • I added some more information to the answer. It is likely list or arrays which are saved in a column which is not supported by the netcdf specification. – JonasV Apr 28 '23 at 07:42
  • After dropping all the non-numeric columns (i.e. str type columns in my dataset), it worked fine. – soumeng78 Apr 29 '23 at 00:50
  • Now, is there a way to write with str type data so that I can preserve the original data set just after some cleaning? – soumeng78 Apr 29 '23 at 00:52
  • In my test, I was able to save strings. One of the objects I wasn't able to save was lists. Maybe check again if you have any columns with lists and remove them. If that doesn't work you might want to try updating xarray and/or netcdf4 – JonasV May 02 '23 at 06:52
  • What I have in one of such columns which shows up as object is very large strings - maybe ~ 3000 characters long. I'll experiment further what happens if I explicitly set the type of the column as 'string'. – soumeng78 May 03 '23 at 21:19
  • I could not make it work with that specific column. At least after dropping that specific column but keeping all other str type columns worked. – soumeng78 May 04 '23 at 19:33
  • 1
    Alright, at least you figured out what was the issue then – JonasV May 05 '23 at 06:45