0

I'm working with histograms presented as pandas Series and representing the realizations of random variables from an observation set. I'm looking for an efficient way to store and read them back.

The histogram's bins are the index of the Series. For example :

histogram1 :
(-1.3747106810983318, 3.529160051186781]    0.012520
(3.529160051186781, 8.433030783471894]      0.013830
(8.433030783471894, 13.336901515757006]     0.016495
(13.336901515757006, 18.24077224804212]     0.007194
(18.24077224804212, 23.144642980327234]     0.041667
(23.144642980327234, 28.048513712612344]    0.000000

I would like to store several of these histograms in a single csv file (one file for each set of random variables, one file would store ~100 histograms), and read them back later exactly as they were before storing (each histogram from the file as a single Series, all values as floats).

How can I do this ? Since speed matters, is there a more efficient way than csv files ?

Therefore, when a new realization of a variable comes in, I would retrieve it's histogram from the corresponding file and assess the bin that it "falls in". Something like this :

# Not very elegant
for bin in histogram1.index:
    if 1.0232545 in bin:
        print("It's in!")
        print(histogram1.loc[bin])

Thanks !

Dr. Paprika
  • 122
  • 3
  • 14

1 Answers1

0

You are addressing two different topics here:

  1. What is an efficient way to store multiple series?
  2. How to determine the bin for a float from an already formed IntervalIndex?

The first part is straightforward. I would use pandas.concat() to create a big frame before saving to csv (or rather

pd.concat(histograms, keys=hist_names, names=['hist_name','bin']).rename('random_variable').to_frame().to_parquet()

see .to_parquet(), this answer, and this benchmark for more

Then when reading back, select a single histogram with

hist1 = df.loc[('hist1', :), 'random_variable']

or

grouped = df.reset_index('hist_name').groupby('hist_name')
hist1 = grouped.get_group('hist1')

The second part is already answered here. In short, you need to flatten the IntervalIndex by:

bins = hist1.index.right

Then you can find the bin for your value (or list of values) with numpy.digitize:

i = np.digitize(my_value, bins)
return_value = hist1.iloc[i]

Edit

Just found this answer about Indexing with an IntervalIndex, which also works:

return_value = hist1.loc[my_value]
RichieV
  • 5,103
  • 2
  • 11
  • 24
  • Thanks for your well documented answer ! Unfortunatly I can't make the first part (`concat`) work. The closest I get is by storing the histograms in a dict after reseting their indexes : `dict = {'histo1':hist1.reset_index(), 'histo2':hist2.reset_index()}` and use concat on the dict : `pd.concat(dict, ignore_index=True).set_index('index')`. This leaves me with one dataframe, but without the individual names : I can't use your solution to read them back individualy. I also can't use `to_parquet()` on it, since `parquet must have string column names`. Would you have any clue why ? – Dr. Paprika Jul 26 '20 at 14:39
  • The second part, `return_value = hist1.loc[my_value]`, is an elegant one liner, thanks ! – Dr. Paprika Jul 26 '20 at 14:41
  • You need to use `keys` and `names` arguments of `concat` so the index in the concatenated df will have two levels (hist_name and interval)... also notice the `rename` after concat, that should change the column name from `0` to a `str` – RichieV Jul 26 '20 at 14:55
  • That's what I did at first, but it gave me an error : `categories must match existing categories when appending` – Dr. Paprika Jul 26 '20 at 15:33