Storing string datasets in hdf5 with unicode

Question

I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å. Here is my code:

import h5py as h5
file = h5.File('deleteme.hdf5','a')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(1,),dtype=dt)
dset.attrs[str(1)] = "some text with ø, æ, å"

However the text is not stored properly. The data stored contains text:

"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"

How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8

Edit:

The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.

The characters look fine when read with Python3 `h5py`. I do see the your codes with `h5dump`. — hpaulj, Jun 20 '17 at 20:15
`h5dump` also shows that the `DATATYPE` of that string is `CSET H5T_CSET_UTF8;` — hpaulj, Jun 20 '17 at 20:23

score 2 · Accepted Answer · edited Jun 21 '17 at 15:47

With:

import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()

file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()

I see:

$ python3 stack44661467.py 
['ø' 'æ' 'å']
some text with ø, æ, å

That is h5py does see/interpret the strings as unicode - writing and reading.

With the dump utility:

$ h5dump deleteme.hdf5 
HDF5 "deleteme.hdf5" {
GROUP "/" {
   DATASET "text" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
      DATA {
      (0): "\37777777703\37777777670", "\37777777703\37777777646",
      (2): "\37777777703\37777777645"
      }
      ATTRIBUTE "1" {
         DATATYPE  H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
         }
      }
   }
}
}

Note that in both case the datatype is marked UTF8

     DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }

That's what the docs say:

http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8

They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.

Let h5py (or other reader) worry about interpreting \37777777703\37777777670 as the proper unicode character.

cosinepenguin · Answer 2 · 2017-06-20T21:58:14.843

1

You should try storing your data in UTF-8 format by doing the following:

To encode in utf-8 format (before storingwith h5py) do:

u"æ".encode("utf-8")

which returns:

'\xc3\xa6'

Then to decode you could use the string decode like this:

'\xc3\xa6'.decode("utf-8")

which would return:

æ

Hope it helps!

EDIT

When you open files and you want them to be in utf-8, you can use the encoding parameter on the read file method:

f = open(fname, encoding="utf-8")

This should help properly encoding the original file.

Source: python-notes

edited Jun 20 '17 at 21:58

answered Jun 20 '17 at 19:33

cosinepenguin

1,545
1
12
21

I am reading the text from a file which contains these characters, and thereupon storing the text. Using your method, I would have to either alter the file itself, or do this on the fly by checking every character that is read. – imranal Jun 20 '17 at 19:42
When I use this `encode` `h5dump` shows the same string, but marks it as `CSET H5T_CSET_ASCII;` – hpaulj Jun 20 '17 at 20:24
Hmm. Sorry you're absolutely right I didn't reread the question after it was edited. I'll do some more searching and try to find something, but I think your solution will have something to do with encoding to utf-8! – cosinepenguin Jun 20 '17 at 21:54
I added a postscriptum to the answer. Does this work? I'm curious to see how to solve this issue. – cosinepenguin Jun 20 '17 at 21:59
The answer chosen verified that the text was being stored as utf-8. I updated my question to verify that I used h5dump to produce the output. – imranal Jun 21 '17 at 12:06

Storing string datasets in hdf5 with unicode

2 Answers2

Linked