"Covering" the space of all possible histogram shapes

Question

There is a very expensive computation I must make frequently.

The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.

I have 2 things going for me:

I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1 .1 .75 .05] will yield similar results.

Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.

To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.

I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.

So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?

are there any histogram shapes that can be definitely excluded, or are very unlikely? — seth, Jul 25 '13 at 02:14
could you also, if possible, give an analogous computation to perform on the histograms, so I can measure if they're "close enough"? — seth, Jul 25 '13 at 02:28

score 3 · Answer 1 · answered Jul 24 '13 at 20:49

3

Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.

It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.

answered Jul 24 '13 at 20:49

James Waldby - jwpat7

8,593
2
22
37

3

I had a similar thought, but using k-means clustering. – Drew Hall Jul 25 '13 at 02:53
1

i second the idea of a neural net in this situation. there isn't enough information to say for sure, but this is basically the textbook/most straightforward application of the entire world of ANN techniques, and you wouldn't have to store any histograms in memory. – Justin L. Jul 25 '13 at 04:47
@jwpat7, the suggestion to try PCA feel like its on the right path. However, I'm not sure if PCA applies here ???? (I have used PCA to reduce 3color images to greyscale -- some I have used it before). Are you suggesting to find, say, 5 basis vectors and then "cover" the space by creating 80-100 combinations of these 5 basis vectors? – Ivan Jul 25 '13 at 17:09
@DrewHall, this sounds promising. I'm assuming you are suggesting to build a bunch of random histograms, and then cluster them into 80-100 clusters. Then the centroids of each cluster should be pretty well spread out? – Ivan Jul 25 '13 at 17:11
@Ivan, yes, find a few good basis vectors; but no, don't use them to cover the space. Instead use them to directly predict your target numbers. Same idea with neural net: Develop a compact model on a development system by exposing a net to the vectors and scores of a much larger data set than you can support in the embedded environment. In the embedded environment, directly predict the numbers you need via that compact model. – James Waldby - jwpat7 Jul 25 '13 at 17:50
@DrewHall, please post your clustering idea as an answer. I'll check it. – Ivan Jul 26 '13 at 13:14
@Ivan: Done! Hope it is useful. – Drew Hall Jul 26 '13 at 22:22

seth · Answer 2 · 2013-07-25T04:33:09.317

I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.

To get a "covering" of the histograms I just iterated through "possible" values for each parameter.

e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.

hists3

Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.

For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.

Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.

Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):

largehist4

edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.

edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.

I'd been thinking about something very similar. The logic to this is undeniable. — Ivan, Jul 25 '13 at 17:12
I think this approach can be combined well with Drew's k-means suggestion. Instead of generating just 81-100 histograms with the parameter approach generate thousands and cluster them into 80-100 groups. Within those thousands you could composite some models to induce more variation, some feature that wouldn't normally arise from a polynomial. I will write some code and tell you how it goes. — seth, Jul 25 '13 at 17:55

score 1 · Accepted Answer · answered Jul 26 '13 at 22:21

1

Building on @jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.

The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.

answered Jul 26 '13 at 22:21

Drew Hall

28,429
12
61
81

To find random histograms I'll use the method proposed by Matti Virkkunen in http://stackoverflow.com/questions/2640053/getting-n-random-numbers-that-the-sum-is-m. Then I'll "exaggerate" a random histogram by raising each entry to a power, and then renormalizing. If the power is "large" the exaggerated histograms will contain n-1 0's and exactly one 1. If the power is very small the exaggerated histogram will contain n entries of 1/n. – Ivan Jul 27 '13 at 18:29

"Covering" the space of all possible histogram shapes

3 Answers3