0

I am trying to convert many .h5 files to a format that can be opened in tableau. since I am new to the python debugger, I can't detect any runtime errors that may be happening. Also,I'm not sure whether itd be better to split up the resulting CSV, or just save it all to one file. I'm not positive how to do either of these things.

from pandas import HDFStore
import pdb

import os
indir = 'C:\Users\Aktosar\data'
for root, dirs, filenames in os.walk(indir)
    for f in filenames:
        Pandas.convert(f)

I also can't decide whether to use .toCSV or the other saving method. Any method that successfully converts all the data to a csv that can be opened in tableau is the right methodfor this excercise. :)

Any help with the completion of this would be much appreciated!

  • The approach I wrote about [here](https://stackoverflow.com/a/44594641/1577947) might help you with the concept. – Jarad Dec 08 '17 at 20:15
  • I'll give it a shot. Shouldn't be too hard to save that final concatenated dataframe to a .csv file. Thanks for the link! – Jacob Coordsen Dec 08 '17 at 20:49
  • For the life of me I can't resolve a 'cannot find the path specified' error. Does that mean anything to you? I swear I have the right path. I noticed the *.* on the end and wondered if that would affect anythig. Thanks! – Jacob Coordsen Dec 09 '17 at 05:52

1 Answers1

0

I think os.walk can be tricky. It's easy to lose track of where you are. On top of it, .h5 with pandas can also be tricky whether you read from pd.HDFStore or use pd.read_hdf. Figuring out dataset names within an h5 file is even more ridiculous. This to say, lots of things can go wrong.

import pandas as pd
import numpy as np
import h5py
import os

dfs = []
for path, dirs, filenames in os.walk(os.curdir):
    if path != os.curdir:
        print(path, dirs, filenames)
        for file in filenames:
            file_path = os.path.join(path, file)
            h5_store = h5py.File(file_path, mode='r')
            dataset_names = list(h5_store.keys())
            for dataset in dataset_names:
                df = pd.DataFrame(h5_store[dataset].value)
                print(file, df.shape)
                dfs.append(df)
            h5_store.close()

final = pd.concat(dfs, ignore_index=True).reset_index()
print(final.shape)

My directory looks something like:

  dir1
      arr0.h5
     dir4
         arr0.h5
         arr1.h5
         arr2.h5
         arr3.h5
  dir2
      arr0.h5
      arr1.h5
  dir3
      arr0.h5
      arr1.h5
      arr2.h5
     dir5
         arr0.h5
     dir6
         arr0.h5
         arr1.h5
         arr2.h5
         arr3.h5
        dir7
            arr0.h5
            arr1.h5
            arr2.h5
            arr3.h5

That df = pd.DataFrame(h5_store[dataset].value) part is key though. That might not work. It depends on the type of data that is. You might try pd.read_hdf(...) if that doesn't work. It also depends on if you have multiple datasets in each .h5 file.

Jarad
  • 17,409
  • 19
  • 95
  • 154
  • I will try that this morning and let you know. Thanks for the response! Also, I am dreading that column names won't be brought in automatically. Is there any way to do that? – Jacob Coordsen Dec 09 '17 at 14:24
  • getting a unable to open file, file signature not found error. Does it have something to do with closing the file? Another article said that this could mean corrupted data... – Jacob Coordsen Dec 09 '17 at 14:45
  • Maybe you have to close the file in the code at some point? [reference](https://stackoverflow.com/a/26730843/1577947). – Jarad Dec 09 '17 at 17:45
  • I have the file close in there but I'm still getting the same failure to open error. I should call it at the end of the outer loop, correct? – Jacob Coordsen Dec 09 '17 at 19:22
  • Can you share the full error? You should call it at the same indentation level as `h5_store` after appending to the dfs. Can you check dfs if anything is even getting in there? – Jarad Dec 09 '17 at 21:59
  • https://i.gyazo.com/5224cc5f4f78f6332024ba9a27581287.png Can you read that tiny prompt text? I'll try to get it to show the dfs. Pretty terrible with this command line python debugger but I'll let you know! – Jacob Coordsen Dec 09 '17 at 22:30
  • I updated my code above to see how you should try to close the file. `file_path` is a string, `h5_store` is what you should try to close. You also need to take care about indenting. – Jarad Dec 09 '17 at 22:57
  • https://i.gyazo.com/5b949c8d1fe0b812cc2fbcaa930ad423.png Here's the pasta of what you sent and I got this same error. I read some places it could be an OS thing? Yeah indenting and python pick on me... – Jacob Coordsen Dec 10 '17 at 00:03
  • In my example above, I am in a root directory. That directory has three directories: dir1, dir2, dir3. In those directories, are more directories and .h5 files. The code above is meant for demonstration-purposes, not a copy/paste solution. It looks like you're running your script in a place where "content", "default", "images" ... exists. If you have "many" .h5 files, you need to plant this code one directory above it and run. Also print variables out and see what they actually are. That's the only way you'll learn this stuff. – Jarad Dec 10 '17 at 00:18
  • Right, I apologize if it seems that way. I was just trying to see if I could solve the error at hand and work it into the solution. I thought my file was in the right place, but on further inspection I'll work on the positioning. I'll also work on printing out variables. Thanks! – Jacob Coordsen Dec 10 '17 at 00:33
  • Solved those problems I was having, now I'm running into an AttributeError trying to actually store the dataframe. Is there any way to see the value of df before it crashes? Is this a fancy way of it saying that value was null?? Thanks! https://gyazo.com/dacdca8f1068d429c42e37c1f078edb9 – Jacob Coordsen Dec 10 '17 at 16:10