1

First of all, I read the topic "Fastest way to write hdf5 file with Python?", but it was not very helpful.

I am trying to load a file which has about 1GB (a matrix of size (70133351,1)) in a h5f5 structure.

Pretty simple code, but slow.

import h5py
f = h5py.File("8.hdf5", "w")
dset = f.create_dataset("8", (70133351,1))

myfile=open("8.txt")

for line in myfile:
   line=line.split("\t")
   dset[line[1]]=line[0]

myfile.close()
f.close()

I have a smaller version of the matrix with 50MB, and I tried the same code, and it was not finished after 24 hours.

I know the way to make it faster is to avoid the "for loop". If I were using regular python, I would use hash comprehension. However, looks like it does not fit here.

I can query the file later by:

f = h5py.File("8.hdf5")
h=f['8']
print 'GFXVG' in h.attrs 

Which would answer me "True" conseidering that GFXVG is on of the keys in h

Does someone have any idea?

Example of part of the file:

508 LREGASKW
592 SVFKINKS
1151        LGHWTVSP
131 EAGQIISE
198 ELDDSARE
344 SQAVAVAN
336 ELDDSARF
592 SVFKINKL
638 SVFKINKI
107 PRTGAGQH
107 PRTGAAAA

Thanks

Community
  • 1
  • 1
user3780518
  • 139
  • 3
  • 9
  • from the post you are [quoting](http://stackoverflow.com/questions/5466971/fastest-way-to-write-hdf5-file-with-python), "read [..] in in chunks as large as you can hold" and write un chunks too. – toine Jun 26 '14 at 18:29
  • Hi Toine. Thanks for pointing it out. Could you show me an example? – user3780518 Jun 26 '14 at 18:33
  • can you put a couple of lines of data from the file 8.txt. – toine Jun 26 '14 at 18:45
  • Sure, it is a tabular file. The idea is that the element in the second column are the primary keys, and I need a fast way to access it without every time have to load the data into a hash. I added an example on the top – user3780518 Jun 26 '14 at 18:54
  • I think the issue is that you are not casting the index into an `int` – daniel Jun 26 '14 at 19:13
  • 1
    Oh, @user3780518, just saw your comment. Sorry for misreading. h5py datasets are arrays and are not hash tables. You'll potentially want to write your own hash function here which can map those `str` to `int` such that you can index into a dataset. – daniel Jun 26 '14 at 19:35
  • Using the final product of my slow code, I could use the structure as a "hash" by loading the file as: `f = h5py.File("8.hdf5") h=f['8'] print 'GFXVG' in h.attrs f.close()` – user3780518 Jun 26 '14 at 20:00
  • 1
    `h.attrs` will be a dict, but it isn't advised to use `attrs` as a dataset. The datasets in hdf5 can essentially be thought of as numpy arrays, and I think there is a fundamental issue with the current approach. That it _works_ for your test doesn't mean it is assured to work. – daniel Jun 26 '14 at 21:15
  • I see. So is there another approach (even without h5py) to not have to read every time the big file and load into a dict? Somehow loads the file somewhere and read the big file as a dict? It is a pain in the neck to every time that I run my program, I have to read the big file into a dict. I thought I would escape from it using h5py. – user3780518 Jun 26 '14 at 21:44
  • you could pickle the dict to disk and load it when you need it. – toine Jun 26 '14 at 22:11

3 Answers3

1

You can load all the data to an numpy array with loadtext and use it to instantiate your hdf5 dataset.

import h5py
import numpy as np

d = np.loadtxt('data.txt', dtype='|S18')

which return

array([['508.fna', 'LREGASKW'],
   ['592.fna', 'SVFKINKS'],
   ['1151.fna', 'LGHWTVSP'],
   ['131.fna', 'EAGQIISE'],
   ['198.fna', 'ELDDSARE'],
   ['344.fna', 'SQAVAVAN'],
   ['336.fna', 'ELDDSARF'],
   ['592.fna', 'SVFKINKL'],
   ['638.fna', 'SVFKINKI'],
   ['107.fna', 'PRTGAGQH'],
   ['1197.fna', 'ELDDSARR'],
   ['1309.fna', 'SQTIYVWF'],
   ['974.fna', 'PNNLRFIA'],
   ['230.fna', 'IGKVYHIE'],
   ['76.fna', 'PGVHSVWV'],
   ['928.fna', 'HERGGAND'],
   ['520.fna', 'VLKTDTTG'],
   ['1290.fna', 'EAALDLHR'],
   ['25.fna', 'FCSILGVV'],
   ['284.fna', 'YHKLTFED'],
   ['1110.fna', 'KITSSSDF']], 
  dtype='|S18')

and then

h = h5py.File('data.hdf5', 'w')
dset = h.create_dataset('data', data=d)

that gives:

<HDF5 dataset "init": shape (21, 2), type "|S18">
toine
  • 1,946
  • 18
  • 24
  • I like the use of `loadtxt`, I always forget about it. I think in the example data though col 0 is `int`, which could of course be managed with passing expected `dtype`s to `loadtxt`. – daniel Jun 26 '14 at 19:32
  • the data format changed in the process apparently .. : ) – toine Jun 26 '14 at 19:36
  • Thanks @toine, but using your method I cannot query my data as I could using the structure created using the slow code. `f = h5py.File("5.hdf5") h=f['5'] print 'GFXVG' in h.attrs ` – user3780518 Jun 26 '14 at 20:10
  • I have updated my question on the top. I believe that now it is more clear – user3780518 Jun 26 '14 at 21:00
0

Since its only a gb, why not load it completely in memory first? Note, it looks like you're also indexing into the dset with a str, which is likely the issue.

I just realized I misread the initial question, sorry about that. It looks like your code is attempting to use the index 1, which appears to be a string, as an index? Perhaps there is a typo?

import h5py
from numpy import zeros

data = zeros((70133351,1), dtype='|S8') # assuming your strings are all 8 characters, use object if vlen

with open('8.txt') as myfile: 
   for line in myfile:
       idx, item = line.strip().split("\t")
       data[int(line[0])] = line[1]

with h5py.File('8.hdf5', 'w') as f:
    dset = f.create_dataset("8", (70133351, 1), data=data)
daniel
  • 2,568
  • 24
  • 32
  • Thanks for your answer, Daniel. However, I believe it does not fix my problem. As I said above, I need the second column to be my primary key, and the way you did it you are using the numbers as primary keys. The number cannot be primary keys here because they repeat (maybe not in the tiny example that I gave about; my bad). By the way, it has to be data[int(idx)] =item rather than data[int(line[1])] = line[0] – user3780518 Jun 26 '14 at 19:56
  • I have updated my question on the top. I believe that now it is more clear – user3780518 Jun 26 '14 at 21:00
0

I ended up using the library shelve (Pickle versus shelve storing large dictionaries in Python) to store a large dictionary into a file. It took me 2 days only to write the hash into a file, but once it was done, I am able to load and access any element very fast. In the end of the day, I dont have to read my big file and write all the information in the has and do whatever I was trying to do with the hash.

Problem solved!

Community
  • 1
  • 1
user3780518
  • 139
  • 3
  • 9