I am referring this question to this. I am making this new thread because I did not really understand the answer given there and hopefully there is someone who could explain it more to me.
Basically my problem is like in the link there.Before, I use np.vstack
and create h5
format file from it. Below are my example:
import numpy as np
import h5py
import glob
path="/home/ling/test/"
def runtest():
data1 = [np.loadtxt(file) for file in glob.glob(path + "data1/*.csv")]
data2 = [np.loadtxt(file) for file in glob.glob(path + "data2/*.csv")]
stack = np.vstack((data1, data2))
h5f = h5py.File("/home/ling/test/2test.h5", "w")
h5f.create_dataset("test_data", data=stack)
h5f.close()
This works perfectly if the size is all same. But when the size is different, it throws me error TypeError: Object dtype dtype('O') has no native HDF5 equivalent
What I understand from the answer given there, I must save the arrays as separate dataset but looking at the example snippet given; for k,v in adict.items()
and grp.create_dataset(k,data=v)
, k
should be the name of the dataset correct? like from my example, test_data
? and what is v
?
Below are what it looks like for vstack
and also stack
[[array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.07812, -0.07812, -0.11719, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([ 0.03906, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.11719, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.15625, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.11719, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.15625, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.11719, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.11719, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.03906, 0.07812, ..., 0.03906, 0.07812, 0. ])
array([ 0.03906, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.11719, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])]
[ array([ 10.9375 , 10.97656, 10.97656, ..., 11.05469, 11.05469, 1. ])
array([ 11.01562, 11.01562, 11.01562, ..., 11.09375, 11.09375, 1. ])
array([ 11.09375, 11.09375, 11.09375, ..., 11.09375, 11.09375, 1. ])
array([ 10.97656, 11.01562, 11.01562, ..., 11.13281, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.01562, ..., 11.09375, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.13281, 1. ])
array([ 11.05469, 11.09375, 11.09375, ..., 11.09375, 11.09375, 1. ])
array([ 11.09375, 11.05469, 11.09375, ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.09375, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.09375, ..., 11.05469, 11.05469, 1. ])
array([ 10.97656, 10.97656, 10.97656, ..., 11.05469, 11.05469, 1. ])
array([ 11.09375, 11.05469, 11.09375, ..., 11.09375, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.09375, 11.17188, 1. ])
array([ 11.09375, 11.09375, 11.09375, ..., 10.97656, 11.09375, 1. ])
array([ 11.09375, 11.09375, 11.09375, ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.01562, 11.05469, ..., 11.01562, 11.01562, 1. ])
array([ 10.78125, 10.78125, 10.78125, ..., 11.05469, 11.05469, 1. ])
array([ 11.13281, 11.09375, 11.13281, ..., 11.09375, 11.09375, 1. ])
array([ 11.13281, 11.09375, 11.09375, ..., 11.05469, 11.05469, 1. ])
array([ 10.97656, 10.97656, 10.9375 , ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.09375, 11.05469, ..., 11.09375, 11.09375, 1. ])
array([ 10.9375 , 10.9375 , 10.9375 , ..., 11.09375, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])
array([ 10.9375 , 10.89844, 10.9375 , ..., 11.05469, 11.09375, 1. ])
array([ 10.9375 , 10.97656, 10.97656, ..., 11.05469, 11.05469, 1. ])
array([ 10.89844, 10.89844, 10.89844, ..., 11.05469, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.01562, 11.01562, 1. ])]]
Thank you for your help and explanation.
Update
I solved the problem by using pandas. At first I used the exact suggestion by Pierre de Buyl but it gave me error when I tried to load/read the file/dataset. I tried with test_data = h5f["data1/file1"][:]
. This gave me an error saying that Unable to open object(Object 'file1' does not exist)
.
I checked by reading the 2test.h5
using pandas.read_hdf
and it shows that the file is empty. I searched online for other solution and I found this. I already modified it:
import numpy as np
import glob
import pandas as pd
path = "/home/ling/test/"
def runtest():
data1 = [np.loadtxt(file) for file in glob.glob(path + "data1/*.csv")]
data2 = [np.loadtxt(file) for file in glob.glob(path + "data2/*.csv")]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
combine = df1.append(df2, ignore_index=True)
# sort the NaN to the left
combinedf = combine.apply(lambda x : sorted(x, key=pd.notnull), 1)
combinedf.to_hdf('/home/ling/test/2test.h5', 'twodata')
runtest()
For reading, I simply use
input_data = pd.read_hdf('2test.h5', 'twodata')
read_input = input_data.values
read1 = read_input[:, -1] # read/get last column for example