0

how to add (Not merging, want to perform mathematical addition i.e. element wise additon of two h5 file) two h5 files (let's sasy 1.h5 and 2.h5) and store into new h5 file, which are same in structure. I tried following:

import h5py

f = h5py.File('1.h5','r')
f1=h5py.File('2.h5','r')
f+f1

but getting following error:

TypeError: unsupported operand type(s) for +: 'File' and 'File'

Following is some information of data set which i got from f.visititems(lambda name,obj:print(name, obj))

conv2d_37 <HDF5 group "/conv2d_37" (1 members)>
conv2d_37/conv2d_37 <HDF5 group "/conv2d_37/conv2d_37" (2 members)>
conv2d_37/conv2d_37/bias:0 <HDF5 dataset "bias:0": shape (32,), type "<f4">
conv2d_37/conv2d_37/kernel:0 <HDF5 dataset "kernel:0": shape (2, 2, 1, 32), type "<f4">
conv2d_38 <HDF5 group "/conv2d_38" (1 members)>
conv2d_38/conv2d_38 <HDF5 group "/conv2d_38/conv2d_38" (2 members)>
conv2d_38/conv2d_38/bias:0 <HDF5 dataset "bias:0": shape (32,), type "<f4">
conv2d_38/conv2d_38/kernel:0 <HDF5 dataset "kernel:0": shape (2, 2, 32, 32), type "<f4">
conv2d_39 <HDF5 group "/conv2d_39" (1 members)>
conv2d_39/conv2d_39 <HDF5 group "/conv2d_39/conv2d_39" (2 members)>
conv2d_39/conv2d_39/bias:0 <HDF5 dataset "bias:0": shape (64,), type "<f4">
conv2d_39/conv2d_39/kernel:0 <HDF5 dataset "kernel:0": shape (2, 2, 32, 64), type "<f4">
conv2d_40 <HDF5 group "/conv2d_40" (1 members)>
conv2d_40/conv2d_40 <HDF5 group "/conv2d_40/conv2d_40" (2 members)>
conv2d_40/conv2d_40/bias:0 <HDF5 dataset "bias:0": shape (64,), type "<f4">
conv2d_40/conv2d_40/kernel:0 <HDF5 dataset "kernel:0": shape (2, 2, 64, 64), type "<f4">
dense_19 <HDF5 group "/dense_19" (1 members)>
dense_19/dense_19 <HDF5 group "/dense_19/dense_19" (2 members)>
dense_19/dense_19/bias:0 <HDF5 dataset "bias:0": shape (256,), type "<f4">
dense_19/dense_19/kernel:0 <HDF5 dataset "kernel:0": shape (7744, 256), type "<f4">
dense_20 <HDF5 group "/dense_20" (1 members)>
dense_20/dense_20 <HDF5 group "/dense_20/dense_20" (2 members)>
dense_20/dense_20/bias:0 <HDF5 dataset "bias:0": shape (2,), type "<f4">
dense_20/dense_20/kernel:0 <HDF5 dataset "kernel:0": shape (256, 2), type "<f4">
dropout_28 <HDF5 group "/dropout_28" (0 members)>
dropout_29 <HDF5 group "/dropout_29" (0 members)>
dropout_30 <HDF5 group "/dropout_30" (0 members)>
flatten_10 <HDF5 group "/flatten_10" (0 members)>
max_pooling2d_19 <HDF5 group "/max_pooling2d_19" (0 members)>
max_pooling2d_20 <HDF5 group "/max_pooling2d_20" (0 members)>

edit

Code copied from comments (where it is unreadable)

data = h5py.File('1.h5','r') 
new_data = h5py.File('new.hdf5','w') 
datasets = getdatasets('/',data) 
groups = list(set([i[::-1].split('/',1)[1][::-1] 
for i in datasets])) 
groups = [i for i in groups if len(i)>0] 
idx = np.argsort(np.array([len(i.split('/')) for i in groups])) 
groups = [groups[i] for i in idx] 
for group in groups: 
new_data.create_group(group) 
for path in datasets: 
    group = path[::-1].split('/',1)[1][::-1] if len(group) == 0: group = '/' 
    data1=h5py.File('2.h5','r') datasets1 = getdatasets('/',data1) 
groups1 = list(set([i[::-1].split('/',1)[1][::-1] for i in datasets1])) 
groups1 = [i for i in groups1 if len(i)>0] 
idx1 = np.argsort(np.array([len(i.split('/')) for i in groups1])) 
groups1 = [groups1[i] for i in idx1] 
for path in datasets1: 
group1 = path[::-1].split('/',1)[1][::-1] 
if len(group1) == 0: 
group1 = '/' #%% 
for key in datasets: 
new_data[key] = data[key][...] + data1[key][...] 
hpaulj
  • 221,503
  • 14
  • 230
  • 353
Hitesh
  • 1,285
  • 6
  • 20
  • 36
  • 2
    May [this answer](https://stackoverflow.com/a/49856991/2646505) is what you are looking for? – Tom de Geus Apr 19 '18 at 07:47
  • 3
    Also note that it is quite logical that there should be some 'manual labor'. The operation proposes quite some ambiguity: Do you want to append one file with the other? Do you want the add the constituents? How do you treat conflicts? Even from your question this remains completely unclear. – Tom de Geus Apr 19 '18 at 07:53
  • @Tom de Geus i am very sorry for my unclear question, next time i will take care of it. I am editing my question. I am not looking for the answer you replied, because it's merging and i am looking for it's for merging and i want to perform mathematical addition of element of h5 file with other h5 file which same in structure – Hitesh Apr 19 '18 at 08:08
  • 1
    In understand. Still [this answer](https://stackoverflow.com/questions/49851046/merge-all-h5-files-using-h5py/49856991#49856991) should be sufficient to do this right? In fact, it is much simpler, one you have all `datasets` you can just loop and add right? Would't `for key in datasets: new[key] = first[key][...] + second[key][...]` work? – Tom de Geus Apr 19 '18 at 08:47
  • not working. The new_data.hdf is empty. following code i tried. code is little bit long, so sending in two parts. `data = h5py.File('1.h5','r') new_data = h5py.File('new.hdf5','w') datasets = getdatasets('/',data) groups = list(set([i[::-1].split('/',1)[1][::-1] for i in datasets])) groups = [i for i in groups if len(i)>0] idx = np.argsort(np.array([len(i.split('/')) for i in groups])) groups = [groups[i] for i in idx] for group in groups: new_data.create_group(group) for path in datasets: group = path[::-1].split('/',1)[1][::-1] if len(group) == 0: group = '/'` – Hitesh Apr 19 '18 at 09:50
  • data1=h5py.File('2.h5','r') datasets1 = getdatasets('/',data1) groups1 = list(set([i[::-1].split('/',1)[1][::-1] for i in datasets1])) groups1 = [i for i in groups1 if len(i)>0] idx1 = np.argsort(np.array([len(i.split('/')) for i in groups1])) groups1 = [groups1[i] for i in idx1] for path in datasets1: group1 = path[::-1].split('/',1)[1][::-1] if len(group1) == 0: group1 = '/' #%% for key in datasets: new_data[key] = data[key][...] + data1[key][...] – Hitesh Apr 19 '18 at 09:50
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/169336/discussion-between-tom-de-geus-and-hitesh). – Tom de Geus Apr 19 '18 at 12:35
  • 1
    `numpy` arrays implement element addition. `python` `lists` implement concatenation. `python` `files` do not have any `addition` method. You'll have to properly edit the added code if you want more help. – hpaulj Apr 19 '18 at 16:51
  • @hpaulj Yes, you are correct, that code in comment is not intended. Sorry for this, but i am unable to write code properly in comments. thnx for editing question. It's good way – Hitesh Apr 20 '18 at 05:21
  • 1
    Comments don't preserve the line breaks that readable code requires. They should only be used for text and short code pieces (single line). – hpaulj Apr 20 '18 at 05:36
  • ok. got it. thnx. – Hitesh Apr 20 '18 at 05:40

1 Answers1

1

I don't fully understand where you got stuck, but I do have a working implementation that does exactly what you want:

import h5py
import numpy as np

# write example files
# -------------------

for name in ['1.hdf5', '2.hdf5']:

  data = h5py.File(name,'w')
  data['A'] = np.arange(25).reshape(5,5)
  data.close()

# support function
# ----------------

def getdatasets(key,archive):

  if key[-1] != '/': key += '/'

  out = []

  for name in archive[key]:

    path = key + name

    if isinstance(archive[path], h5py.Dataset):
      out += [path]
    else:
       out += getdatasets(path,archive)

  return out

# perform copying
# ---------------

# open both source-files and the destination
data1    = h5py.File('1.hdf5'  ,'r')
data2    = h5py.File('2.hdf5'  ,'r')
new_data = h5py.File('new.hdf5','w')

# get datasets
datasets  = sorted(getdatasets('/', data1))
datasets2 = sorted(getdatasets('/', data2))

# check consistency of datasets
# - number
if len(datasets) != len(datasets2):
  raise IOError('files not consistent')
# - item-by-item
for a,b in zip(datasets, datasets2):
  if a != b:
    raise IOError('files not consistent')

# get the group-names from the lists of datasets
groups = list(set([i[::-1].split('/',1)[1][::-1] for i in datasets]))
groups = [i for i in groups if len(i)>0]

# sort groups based on depth
idx    = np.argsort(np.array([len(i.split('/')) for i in groups]))
groups = [groups[i] for i in idx]

# create all groups that contain a dataset
for group in groups:
  new_data.create_group(group)

# copy (add) datasets
for path in datasets:

  # - get group name
  group = path[::-1].split('/',1)[1][::-1]

  # - minimum group name
  if len(group) == 0: group = '/'

  # - copy data
  new_data[path] = data1[path][...] + data2[path][...]

# verify
# ------

# copy (add) datasets
for path in datasets:
  print(new_data[path][...])

# close all files
# ---------------

new_data.close()
data1.close()
data2.close()

which gives indeed twice the arange that was used as an example:

[[ 0  2  4  6  8]
 [10 12 14 16 18]
 [20 22 24 26 28]
 [30 32 34 36 38]
 [40 42 44 46 48]]

I really think that the question was already answered here. The explanation is all there.

Tom de Geus
  • 5,625
  • 2
  • 33
  • 77
  • yup, now it is correct and properly working, thnx. can you please explain the need of line `raise IOError('files not consistent')` twice. if we are check `if len(datasets) != len(datasets2):` then it's ok. what is need of checking again `for a,b in zip(datasets, datasets2):` – Hitesh Apr 20 '18 at 05:34
  • I am using spyder to run my program. when i run given code, all things are ok and i am getting answer in console. but when i check my `new.h5` file, it's empty. What can be the reason. an HDFview2.9 for viewing the file – Hitesh Apr 20 '18 at 05:39
  • @Hitesh The file can hardly be empty as the output is printed from the file. **Do note that in my example the file is called `new.hdf5`**, *not `new.h5`*. – Tom de Geus Apr 20 '18 at 07:54
  • 1
    @Hitesh The checks are necessary to deal with the ambiguity of the operation that you want to accomplish. These checks make sure that at least there structure is identical. Presumable the `+` operation will to the same for the contents of the datasets. But more extensive checking may be needed. *Note once more: the input could be arbitrary files, there is no a-priory reason that it should always work: you thus need to check.* – Tom de Geus Apr 20 '18 at 07:57
  • 1
    @Hitesh Sorry, I partly misread your question. The two checks are necessary because one checks that the number of datasets (paths) is the same. The other loops over all paths to check that their names are identical. – Tom de Geus Apr 20 '18 at 08:07
  • one doubt, may be i am unable to see the content of `new.hdf5` file because it is write only file. (I am not sure i am saying this is because of `new_data = h5py.File('new.hdf5','w') ` – Hitesh Apr 20 '18 at 09:08
  • @Hitesh Did you close the files? You might run in some interactive mode, so you might not explicitly close them, see edit. In my case I run only the script, closing is automatically done when the script finishes. – Tom de Geus Apr 20 '18 at 09:27
  • ooh god, i forgot to close. now i get output. – Hitesh Apr 20 '18 at 09:51