First of all, I read the topic "Fastest way to write hdf5 file with Python?", but it was not very helpful.
I am trying to load a file which has about 1GB (a matrix of size (70133351,1)) in a h5f5 structure.
Pretty simple code, but slow.
import h5py
f = h5py.File("8.hdf5", "w")
dset = f.create_dataset("8", (70133351,1))
myfile=open("8.txt")
for line in myfile:
line=line.split("\t")
dset[line[1]]=line[0]
myfile.close()
f.close()
I have a smaller version of the matrix with 50MB, and I tried the same code, and it was not finished after 24 hours.
I know the way to make it faster is to avoid the "for loop". If I were using regular python, I would use hash comprehension. However, looks like it does not fit here.
I can query the file later by:
f = h5py.File("8.hdf5")
h=f['8']
print 'GFXVG' in h.attrs
Which would answer me "True" conseidering that GFXVG is on of the keys in h
Does someone have any idea?
Example of part of the file:
508 LREGASKW
592 SVFKINKS
1151 LGHWTVSP
131 EAGQIISE
198 ELDDSARE
344 SQAVAVAN
336 ELDDSARF
592 SVFKINKL
638 SVFKINKI
107 PRTGAGQH
107 PRTGAAAA
Thanks