3

I'm attempting to load a large data set. I have ~8k day files, each with arrays of hundreds of measurements. I can load a single day file into a set of numpy arrays, which I store in a dictionary. To load all the day files, I initialize a dictionary with the desired keys. Then I loop through the list of files, loading one, and attempt to store them in the larger dictionary.

    all_measurements = np.asarray([get_n_measurements(directory, name) for name in files])

    error_files = []

    temp = np.full(all_measurements.sum()
    all_data = {key: temp.copy(), 
                     fill_value, dtype=np.float64) for key in sample_file}

    start_index = 0
    for data_file, n_measurements in zip(file_list, all_measurements):

        file_data = one_file(data_file) # Load one data file into a dict.

        for key, value in file_data.iteritems(): # I've tried .items(), .viewitems() as well.

            try:

                all_data[key][start_index : start_index + n_measurements] = file_data[key]

            except ValueError, msg:

                error_files.append((data_file, msg))

            finally:

                start_index += n_measurements

I've inspected the results of one_file() and I know that it properly loads the data. However, the combined all_data behaves as if every value is identical across key:value pairs.

Here is an example of the data structures:

all_data  = {'a': array([ 0.76290858,  0.83449302,  ...,  0.06186873]), 
             'b': array([ 0.32939997,  0.00111448,  ..., 0.72303435])}

file_data = {'a': array([ 0.00915347,  0.39020354]),
             'b': array([ 0.8992421 ,  0.18964702])}

In each iteration of the for loop, I attempt to insert the file_data into all_data at the indices [start_index : start_index + n_measurements].

blalterman
  • 565
  • 7
  • 17
  • I'm not sure I understand the entire question, but would something like this work, namely combining two dictionaries? http://stackoverflow.com/questions/38987/how-can-i-merge-two-python-dictionaries-in-a-single-expression or: http://stackoverflow.com/questions/1781571/how-to-concatenate-two-dictionaries-to-create-a-new-one-in-python?lq=1 – db1234 Jul 31 '15 at 19:07
  • @db1234 It would not. The keys in both `all_data` and `file_data` are identical. The arrays stored in `all_data` are ~6M long. The arrays stored in `file_data` are <1k measurements long. Each `file_data` contains a timeseries that I'm trying to join, in order, within `all_data`. – blalterman Jul 31 '15 at 19:19
  • To make clearer please provide some sample datastructure with the required dict, list etc. And please explain what means that $file_data$ why the "$" character – Geeocode Jul 31 '15 at 19:26
  • @GeorgeSolymosi $file_data$ was a type. It should have read file_data. file_data is simply a dictionary of numpy arrays that I have loaded from a single data file. I've added examples of the data to the original post. – blalterman Jul 31 '15 at 20:09
  • Maybe do a loop over the files and append the key. Something like: all_data['a'].append(file_data['a']) – db1234 Jul 31 '15 at 20:11
  • you says "I can't get the last two lines", you think last two array element i.e. last two `file_data`? – Geeocode Jul 31 '15 at 20:38
  • @GeorgeSolymosi I don't understand your question. – blalterman Jul 31 '15 at 20:52
  • I mean what you meant under "two lines", last two file_data or? – Geeocode Jul 31 '15 at 21:07

2 Answers2

0

If I interpreted your code well, and in the case if n_measurement serves to provide the number of total measurements, you rather meant to do something like this:

all_measurements = np.array(
                           [len(n_measurements) 
                           for n_measurements in file_list]
                           )

Or how could serves all_measurements.sum() as the shape of your new initializable np.array?

Geeocode
  • 5,705
  • 3
  • 20
  • 34
  • Sorry for the confusion all_measurements already does something like that. I've edited the code to reflect what it actually does. – blalterman Jul 31 '15 at 21:16
  • @user1200989 What is the purpose of the try finally clause? – Geeocode Jul 31 '15 at 22:02
  • @GeorgeSoymosi some of the files loaded with one_file() have errors in them and I don't want to load them. The try/finally clause allows me to catch the errors (now shown) and load the next file having skipped the appropriate sections in the data. – blalterman Jul 31 '15 at 22:17
  • @user1200989 Please a data snippet from this: "all_data behaves as if every value is identical across key:value pairs" – Geeocode Jul 31 '15 at 22:42
0

Turns out everything was going into the same container. The above code has been edited with the issue corrected.

blalterman
  • 565
  • 7
  • 17