0

I am referring this question to this. I am making this new thread because I did not really understand the answer given there and hopefully there is someone who could explain it more to me.

Basically my problem is like in the link there.Before, I use np.vstack and create h5 format file from it. Below are my example:

import numpy as np
import h5py
import glob

path="/home/ling/test/"

def runtest():
    data1 = [np.loadtxt(file) for file in glob.glob(path + "data1/*.csv")]
    data2 = [np.loadtxt(file) for file in glob.glob(path + "data2/*.csv")]

    stack = np.vstack((data1, data2))

    h5f = h5py.File("/home/ling/test/2test.h5", "w") 
    h5f.create_dataset("test_data", data=stack)
    h5f.close()

This works perfectly if the size is all same. But when the size is different, it throws me error TypeError: Object dtype dtype('O') has no native HDF5 equivalent

What I understand from the answer given there, I must save the arrays as separate dataset but looking at the example snippet given; for k,v in adict.items()and grp.create_dataset(k,data=v), k should be the name of the dataset correct? like from my example, test_data? and what is v ?

Below are what it looks like for vstack and also stack

[[array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([-0.07812, -0.07812, -0.11719, ..., -0.07812, -0.07812,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([ 0.03906,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.11719,  0.     ])
  array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([-0.15625, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([-0.11719, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.15625,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([-0.07812, -0.07812, -0.07812, ..., -0.11719, -0.07812,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([-0.07812, -0.11719, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812,  0.     ])
  array([ 0.07812,  0.03906,  0.07812, ...,  0.03906,  0.07812,  0.     ])
  array([ 0.03906,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.11719,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])
  array([ 0.07812,  0.07812,  0.07812, ...,  0.07812,  0.07812,  0.     ])]
 [ array([ 10.9375 ,  10.97656,  10.97656, ...,  11.05469,  11.05469,   1.     ])
  array([ 11.01562,  11.01562,  11.01562, ...,  11.09375,  11.09375,   1.     ])
  array([ 11.09375,  11.09375,  11.09375, ...,  11.09375,  11.09375,   1.     ])
  array([ 10.97656,  11.01562,  11.01562, ...,  11.13281,  11.09375,   1.     ])
  array([ 11.05469,  11.05469,  11.01562, ...,  11.09375,  11.09375,   1.     ])
  array([ 11.05469,  11.05469,  11.05469, ...,  11.05469,  11.05469,   1.     ])
  array([ 11.05469,  11.05469,  11.05469, ...,  11.05469,  11.13281,   1.     ])
  array([ 11.05469,  11.09375,  11.09375, ...,  11.09375,  11.09375,   1.     ])
  array([ 11.09375,  11.05469,  11.09375, ...,  11.05469,  11.05469,   1.     ])
  array([ 11.05469,  11.05469,  11.05469, ...,  11.09375,  11.09375,   1.     ])
  array([ 11.05469,  11.05469,  11.09375, ...,  11.05469,  11.05469,   1.     ])
  array([ 10.97656,  10.97656,  10.97656, ...,  11.05469,  11.05469,   1.     ])
  array([ 11.09375,  11.05469,  11.09375, ...,  11.09375,  11.09375,   1.     ])
  array([ 11.05469,  11.05469,  11.05469, ...,  11.05469,  11.05469,   1.     ])
  array([ 11.05469,  11.05469,  11.05469, ...,  11.09375,  11.17188,   1.     ])
  array([ 11.09375,  11.09375,  11.09375, ...,  10.97656,  11.09375,   1.     ])
  array([ 11.09375,  11.09375,  11.09375, ...,  11.05469,  11.05469,   1.     ])
  array([ 11.05469,  11.05469,  11.05469, ...,  11.05469,  11.05469,   1.     ])
  array([ 11.05469,  11.01562,  11.05469, ...,  11.01562,  11.01562,   1.     ])
  array([ 10.78125,  10.78125,  10.78125, ...,  11.05469,  11.05469,   1.     ])
  array([ 11.13281,  11.09375,  11.13281, ...,  11.09375,  11.09375,   1.     ])
  array([ 11.13281,  11.09375,  11.09375, ...,  11.05469,  11.05469,   1.     ])
  array([ 10.97656,  10.97656,  10.9375 , ...,  11.05469,  11.05469,   1.     ])
  array([ 11.05469,  11.09375,  11.05469, ...,  11.09375,  11.09375,   1.     ])
  array([ 10.9375 ,  10.9375 ,  10.9375 , ...,  11.09375,  11.09375,   1.     ])
  array([ 11.05469,  11.05469,  11.05469, ...,  11.05469,  11.05469,   1.     ])
  array([ 10.9375 ,  10.89844,  10.9375 , ...,  11.05469,  11.09375,   1.     ])
  array([ 10.9375 ,  10.97656,  10.97656, ...,  11.05469,  11.05469,   1.     ])
  array([ 10.89844,  10.89844,  10.89844, ...,  11.05469,  11.09375,   1.     ])
  array([ 11.05469,  11.05469,  11.05469, ...,  11.01562,  11.01562,   1.     ])]]

Thank you for your help and explanation.

Update

I solved the problem by using pandas. At first I used the exact suggestion by Pierre de Buyl but it gave me error when I tried to load/read the file/dataset. I tried with test_data = h5f["data1/file1"][:]. This gave me an error saying that Unable to open object(Object 'file1' does not exist).

I checked by reading the 2test.h5 using pandas.read_hdf and it shows that the file is empty. I searched online for other solution and I found this. I already modified it:

import numpy as np
import glob

import pandas as pd

path = "/home/ling/test/"

def runtest():
    data1 = [np.loadtxt(file) for file in glob.glob(path + "data1/*.csv")]
    data2 = [np.loadtxt(file) for file in glob.glob(path + "data2/*.csv")]

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

combine = df1.append(df2, ignore_index=True)

# sort the NaN to the left
combinedf = combine.apply(lambda x : sorted(x, key=pd.notnull), 1)
combinedf.to_hdf('/home/ling/test/2test.h5', 'twodata')


runtest()

For reading, I simply use

input_data = pd.read_hdf('2test.h5', 'twodata')
read_input = input_data.values

read1 = read_input[:, -1] # read/get last column for example
Community
  • 1
  • 1
Ling
  • 891
  • 5
  • 17
  • 40
  • What does `stack` look like? – John Zwinck Sep 25 '17 at 08:19
  • it gives me the same error. – Ling Sep 25 '17 at 09:32
  • What does stack look like? If you're confused what I'm asking about, just do `print(stack)` and copy/paste the output into your question. – John Zwinck Sep 25 '17 at 12:37
  • sorry for misunderstood your question. I edited my question with the result for `stack` together with `vstack` – Ling Sep 26 '17 at 03:32
  • What if you do `stack = np.vstack((np.vstack(data1), np.vstack(data2)))`? – John Zwinck Sep 26 '17 at 04:04
  • it throws me an error `ValueError: all the input array dimensions except for the concatenation axis must match exactly` – Ling Sep 26 '17 at 04:07
  • OK, so you need to make your array widths match. Every row needs the same number of columns. – John Zwinck Sep 26 '17 at 04:23
  • by making my array widths match, did you mean to cut or add new data to make it same? I think this is what my question from the beginning how to save as h5 file with arrays that have different size. cutting or adding data will interfere with my original data so I do not want to do that. By looking at the link I pasted from similar question, the answer did not included `stack` or `vstack` but I do not understand the answer so I was hoping if you have better understanding than me regarding the answer – Ling Sep 26 '17 at 04:42
  • 1
    Either you need to add the missing columns to the shorter rows, filling them with 0 or NAN, or you need to store the data in two separate HDF5 DataSets. You cannot store "jagged" (non-rectangular) arrays in the way you are trying to do. – John Zwinck Sep 26 '17 at 05:55

1 Answers1

5

The basic elements in a HDF5 file are groups (similar to directories) and datasets (similar to arrays).

NumPy will create an array with a lot of different inputs. When one attempts to create an array from disparate elements (i.e. different lengths), NumPy returns an array of type 'O'. Look for object_ in the NumPy reference guide. Then, there is little advantage to use NumPy as this resembles a standard Python list.

HDF5 cannot store arrays of type 'O' because it does not have generic datatypes (only some support for C struct type objects).

The most obvious solution to your problem is to store your data in HDF5 dataset, with "one dataset" per table. You retain the advantage of collecting the data in a single file and you have "dict-like" access to the elements.

Try the following code:

import numpy as np
import h5py
import glob

path="/home/ling/test/"

def runtest():
    h5f = h5py.File("/home/ling/test/2test.h5", "w") 
    h5f.create_group('data1')
    h5f.create_group('data2')

    [h5f.create_dataset(file[:-4], data=np.loadtxt(file)) for file in glob.glob(path + "data1/*.csv")]
    [h5f.create_dataset(file[:-4], data=np.loadtxt(file)) for file in glob.glob(path + "data2/*.csv")]

    h5f.close()

For reading:

h5f = h5py.File("/home/ling/test/2test.h5", "r")
test_data = h5f['data1/thefirstfilenamewithoutcsvextension'][:]
Pierre de Buyl
  • 7,074
  • 2
  • 16
  • 22
  • Hi, your answer worked for me. just want to ask, can you elaborate on `file[:-4]` ? – Ling Sep 29 '17 at 09:22
  • 1
    `file` is a string that ends with `.csv`. As it is not typical to include the file extension in the name of a dataset, I removed it. For strings, `file[:-4]` will return the string with the last four elements omitted, that is without the trailing `.csv`. – Pierre de Buyl Sep 29 '17 at 09:31
  • hi, sorry for asking again. apparently, for reading it gave me error saying that `Unable to open object (Object "file1" does not exist)` – Ling Oct 02 '17 at 02:30
  • Can you give explicitly the line giving this error? Have you use the path to the file as `data1/mlqskfdjmlsqjfd` (without the `.csv`) to access the data ? – Pierre de Buyl Oct 02 '17 at 08:17
  • sorry I can't provide you with the detail error. However, I solved the problem. I already updated my question above with my solution. I believe there must be other good solution but for right now I am going to stick with what I found. Thank you for your help. – Ling Oct 03 '17 at 09:27
  • 1
    As far as I understand, you changed the solution to use Pandas advanced storage facilities that will mask the technicalities. The original problem is still solved using the answer I provided. I find it unappropriate to "un-approve" the answer at this point but whatever. This is mostly a note for other users interested in using HDF5 storage, however. – Pierre de Buyl Oct 03 '17 at 09:36