Modelling a probability distribution as a fuzzy set in Python3

Question

I'm trying to build a fuzzy set from a series of example values with python3.

For instance, given [6, 7, 8, 9, 27] I'd like to obtain a function that:

returns 0.0 from 0 to 5ca,
goes gradually up to 1.0 from 5ca to 6,
stays at 1.0 from 6 to 9,
goes gradually down to 0.0 from 9 to 10ca,
stays at 0.0 from 10ca to 26ca,
goes gradually up to 1.0 from 26ca to 27,
goes gradually down to 0.0 from 27 to 28ca,
returns 0.0 from 28ca and afterwards.

Notice that the y values are always in the range [0.0, 1.0] and if a series is missing a value, the y of that value is 0.0.

Please consider that in the most general case, the input values might be something like [9, 41, 20, 13 ,11, 12, 14, 40, 4, 4, 4, 3, 34, 22] (values can always be sorted, but notice that in this series the value 4 is repeated 3 times therefore I'd expect to have a probability of 1 and all the other values a lower probability value -- not necessarily 1/3 as in this case).

The top part of this picture shows the desired function plotted up to x=16 (hand drawn). I'd be more than happy to obtain anything like it. The bottom part of the picture shows some extra feature that would be nice to have but are not strictly mandatory:

better smoothing than shown in my drawing (A),
cumulative effect (B) provided that...
the function never goes above 1 (C) and...
the function never goes below 0 (D).

I've tried some approaches adapted from polyfit, bezier, gauss or others, for instance, but the results weren't what I expected. I've also tried with package fuzzpy but I couldn't make it work because of its dependency to epydoc which seems not to be compatible with python3. No luck as well with StatModels.

Can anyone suggest how to achieve the desired function? Thanks in advance.

If you wonder, I plan to use the resulting function to predict the likelihood of a given value; with respect to the fuzzy set described above, for instance, 4.0 returns 0.0, 6.5 returns 1.0 and 5.8 something like 0.85. Maybe there is another simpler way to do this?

This is how I usually process the input values (not sure if the part that adds the 0s is needed), what show I have instead ??? to compute the desired f?

def prepare(values, normalize=True):
    max = 0
    table = {}
    for value in values:
        table[value] = (table[value] if value in table else 0) + 1
        if normalize and table[value] > max:
            max = table[value]

    if normalize:
        for value in table:
           table[value] /= float(max)

    for value in range(sorted(table)[-1] + 2):
        if value not in table:
            table[value] = 0

    x = sorted(table)
    y = [table[value] for value in x]
    return x, y

if __name__ == '__main__':
    # get x and y vectors
    x, y = prepare([9, 41, 20, 13, 11, 12, 14, 40, 4, 4, 4, 3, 34, 22], normalize=True)

    # calculate fitting function
    f = ???

    # calculate new x's and y's
    x_new = np.linspace(x[0], x[-1], 50)
    y_new = f(x_new)

    # plot the results
    plt.plot(x, y, 'o', x_new, y_new)
    plt.xlim([x[0] - 1, x[-1] + 1])
    plt.show()

    print("Done.")

A practical example, just to clarify the motivations for this... The series of values might be the number of minutes after which persons give up standing in line in front of a kiosk... With such a model, we could try to predict how likely somebody will leave the queue by knowing how long has been waiting. The value read in this way can be then defuzzyfied, for instance, in happily waiting [0.00, 0.33], just waiting (0.33, 0.66] and about to leave (0.66, 1.00]. In case of about to leave that somebody could be engaged by something (and ad?) to convince him to stay.

what does "ca" mean? and it's not clear to me what the function you're looking for should output for your [9, 41, 20, 13 ,11, 12, 14, 40, 4, 4, 4, 3, 34, 22] example. Could you clarify? — Ryan Stout, Jul 26 '17 at 17:54
what should the output of the function be for [9, 41, 20, 13 ,11, 12, 14, 40, 4, 4, 4, 3, 34, 22] ? — Ryan Stout, Jul 26 '17 at 18:04
the function of the longer example would be 0.0 from 0 to 2 or something more, then goes up to 1.0 from 2 or something more to 3, goes down to 1/3 somewhere between 3 and 4, then goes to 0.0 and stays to 0.0 till just before 9, etc. (if I could draw the resulting function easily, I would have included the picture... sorry about that). — Stefano Bragaglia, Jul 26 '17 at 18:06
That's just an example: 4 is the most common value, whoose frequency is 3. All the other values in the example have frequency 1, which is 1/3 of the highest frequecy. — Stefano Bragaglia, Jul 26 '17 at 22:52
@RyanStout just added a hand made drawing... I hope it makes the question clearer... — Stefano Bragaglia, Jul 27 '17 at 10:35

Eric · Answer 1 · 2017-07-27T11:35:51.937

0

def pulse(x):
    return np.maximum(0, 1 - abs(x))

def fuzzy_in_unscaled(x, xs):
    return pulse(np.subtract.outer(x, xs)).sum(axis=-1)

def fuzzy_in(x, xs):
    largest = fuzzy_in_unscaled(xs, xs).max()
    return fuzzy_in_unscaled(x, xs) / largest

>>> fuzzy_in(1.5, [1, 3, 4, 5])  # single membership
0.5
>>> fuzzy_in([[1.5, 3], [3.5, 10]], [1, 3, 4, 5])  # vectorized in the first argument
array([[0.5, 1], [1, 0]])

This exploits the fact that the peak values must lie on the elements. This is not true for all pulse functions.

You'd do well to precompute largest, as it's O(N^2)

edited Jul 27 '17 at 11:35

answered Jul 26 '17 at 23:04

Eric

95,302
53
242
374

Thanks for your answer! Just to be sure I got it right: xs is my series of values and x is the value I want to test? What is largest? The largest x or the largest y? Can you possibly add a little main? TIA! – Stefano Bragaglia Jul 27 '17 at 09:21
xs is indeed the set of values to look for membership in. This is vectorized in x, so you can test for membership of a whole array at once. Largest is the arbitrary scale factor needed to make the outcome never exceed 1 – Eric Jul 27 '17 at 09:34
x, y = prepare([9, 41, 20, 13, 11, 12, 14, 40, 4, 4, 4, 3, 34, 22], normalize=True) returns all the x and the respective y already normalised to 1. Dividing by largest is hence needed no more? Still don't grasp how to call your function. See the updated snippet in my question... – Stefano Bragaglia Jul 27 '17 at 10:30
Is there a way to make the base of the pulse larger? Thanks again! – Stefano Bragaglia Jul 27 '17 at 11:18
`np.maximum(0, 1 - abs(x / 2))` would do that – Eric Jul 27 '17 at 11:19
brilliant, thanks! Do you think it is possible to get the same result by combining several gauss curves by means of NoisyOR (p_1 + p_2 - p_1 * p_2)? – Stefano Bragaglia Jul 27 '17 at 12:33
Only if you can define `NoisyOR` for more than two inputs. Is NoisyOR associative (`NO(NO(x, y), z) == NO(x, NO(y, z))`)? Also, gaussian `pulse`s will break the assumption I mention in the question, so computing `largest` will be harder – Eric Jul 27 '17 at 13:07

Daniel F · Answer 2 · 2017-07-27T11:25:29.077

0

This only works (due to np.bincount) with a set of integers.

def fuzzy_interp(x, vals):
    vmn, vmx = np.amin(vals), np.amax(vals)
    v = vals - vmn + 1
    b = np.bincount(v, minlength = vmx - vmn + 2)
    b = b / np.amax(b)
    return np.interp(x - vmn - 1, np.arange(b.size), b, left = 0, right = 0)

edited Jul 27 '17 at 11:25

answered Jul 27 '17 at 07:17

Daniel F

13,620
2
29
55

Thanks for your answer! Just tried out your solution... I think that vmin in line 3 should be replaced by vmn but anyway the result seems to be shifted right by 1... – Stefano Bragaglia Jul 27 '17 at 10:57

Modelling a probability distribution as a fuzzy set in Python3

2 Answers2