4

I have multiple (25k) .csv files that I'm trying to append into a HDFStore file. They all share identical headers. I am using the below code, but for some reason whenever I run it the dataframe isn't appended with all of the files, but rather is only the last file in the list.

filenames = []  #list of .csv file paths that I've alredy populated
dtypes= {dict of datatypes}
store = pd.HDFStore('store.h5')
store.put('df', pd.read_csv(filenames[0],dtype=dtypes,parse_dates=
["date"])) #store one data frame

for f in filenames:
    try:
        temp_csv = pd.DataFrame()
        temp_csv = pd.read_csv(f,dtype=dtypes,parse_dates=["trade_date"]) 
        store.append('df', temp_csv)
    except:
    pass

I've tried using a subset of the filenames list, but always get the last entry. For some reason, the loop is not appending my file, but rather overwriting it every single time. Any advice would be appreciated as this is driving me bonkers. (python 3, windows)

  • 1
    If you don't have to do it with `pandas`, you can do it with normal python `open` command. Take a look at this [link](http://stackoverflow.com/questions/2363731/append-new-row-to-old-csv-file-python) – cookiedough Jun 09 '17 at 19:17
  • Thanks for the suggestion- I'll give this a try. I'm using this approach because the csv files each have around 100k rows and there's 25k of them. When I tried to do it with just a dataframe, not the hdf file, my computer kept crashing because the data set was just too large. – Aaron Pujanandez Jun 09 '17 at 19:40
  • 1
    Catch all **except** are rarely a good idea. What does your **except: pass** hide? – JL Peyret Jun 09 '17 at 22:58

2 Answers2

0

I think the problem is related to:

store.append('df', temp_csv)

If I correctly understand what you're trying to do, 'df' should change every iteration, you're just overwriting it now.

SeaMonkey
  • 131
  • 9
  • When I tried doing my store contained all of the df as separate files. According to the documentation, https://pandas.pydata.org/pandas-docs/stable/generated/pandas.HDFStore.append.html, the first parameter is the key and the second should be the value. Unless I'm reading this wrong. I was also working off of the "Table Format" example found here: http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5. – Aaron Pujanandez Jun 09 '17 at 19:34
  • 1
    The thing is that I'm not sure that you can have two values for the same key. I think that the first one is overwritten – SeaMonkey Jun 09 '17 at 19:50
0

You're creating/storing a new DataFrame with each iteration, like @SeaMonkey said. Your consolidated dataframe should be instantiated outside your loop, something like this.

filenames = []  #list of .csv file paths that I've alredy populated
dtypes= {dict of datatypes}

df = pd.DataFrame()
for f in filenames:
    df_tmp = pd.read_csv(f,dtype=dtypes,parse_dates=["trade_date"]) 
    df = df.append(df_tmp)

store = pd.HDFStore('store.h5')
store.put('df', df)
hobs
  • 18,473
  • 10
  • 83
  • 106