1

I have a number of some N imputes and corresponding outputs of a function for these input values.

What I'd like to is estimate the mathematical function for given outputs and inputs. Is it closer to the form of f(N), f(N^2), log(N), N*lolg(N) or 2^N.

Essentially, what I'd like to do is estimate big O. Thus, n is amount of input data, and the outputs is compute time. So basically I'd like to at least know to which of above mentioned functioned mine function is closer.

Apsed1976
  • 31
  • 8
  • Start here: [A plain-English eplanation of Big O notation](http://stackoverflow.com/questions/487258/what-is-a-plain-english-explanation-of-big-o-notation) – azurefrog Sep 29 '16 at 18:33
  • 1
    Just do a least-squares fitting of the unknown factor and see how well it fits. Of course, this will not account for the *for all n > n0* part in Big O. – Nico Schertler Sep 29 '16 at 18:35
  • 1
    Please choose one language tag or remove them all. – Saeid Sep 29 '16 at 18:44
  • @NicoSchertler Could you kindly provide with more details. Thanks – Apsed1976 Sep 29 '16 at 19:48
  • almost duplicate to [Finding the mathematical algorithm to which matches an input and output together](http://stackoverflow.com/q/31466518/2521214) – Spektre Sep 30 '16 at 07:17

2 Answers2

2

You can use the Least Squares method to find the function that has the least distance to your data.

Assuming you have a list of sample observations of your unknown function in the shape of some order pairs (x,y) or (x,f(x)). You can measure the distance of this unknown function to some known function g using Least Squares method.

distance = 0
for x,y in sample pairs
    distance += ( y - g(x) )^2

As much as this distance gets smaller your unknown function is closer to the known function g.

Now If you want to find the closest function (from a pre-determined list of functions) to your unknown function, you just have to calculate each of those functions distance to your unknown function. Whichever had the least distance is more similar to your unknown function.

Note that this method is an approximation and it isn't 100% accurate but you can increase its accuracy by providing a larger and more comprehensive sample data


Here is a sample Python implementation:

import math

functions = []

functions.append( lambda n: n )                # y = n
functions.append( lambda n: n*n )              # y = n^2
functions.append( lambda n: math.log(n,2) )    # y = log(n)
functions.append( lambda n: n*math.log(n,2) )  # y = n*log(n)
functions.append( lambda n: 2**n )             # y = 2^n

# creating the sample data the unknown function is n + 10*n*log(n)
pairs = [ (n, n + 10*n*math.log(n,2) ) for n in range(1,200,1) ]

# calculating the distance of each function to the sample data
# and add them to a list
distances = [ ]
for func in functions:
    d = 0
    for n,y in pairs:
        d += (func(n) - y)**2 
    distances.append(d)

# finding the minimum value and printing the index   
print distances.index(min(distances))

Output

3

It means the 4th function is closest to our sample data which is n*log(n).

Note that if we reduce the size of the sample data like this (getting rid of half the sample data):

pairs = [ (n, n + 10*n*math.log(n,2) ) for n in range(1,100,1) ]

This program will print 1 which means that the closest function is n2. This shows the importance of the sample data.

Saeid
  • 4,147
  • 7
  • 27
  • 43
  • Could you kindly explain in more detail last two lines? I experience hard time understanding them. – Apsed1976 Sep 29 '16 at 20:02
  • It means the bigger your dataset is the more accurate your approximation will be. I think having over 1000 sample data would be good enough. – Saeid Sep 29 '16 at 20:04
  • I don't understand what is happening there. Especially in the second to last line. Could you kindly explain? – Apsed1976 Sep 29 '16 at 20:09
  • @sudomakeinstall2 I posted a small addendum as a separate answer. Would you like to test if my assumption of increased robustness is correct? I do not have a Python installation at hand... – Nico Schertler Sep 29 '16 at 21:35
  • @NicoSchertler Yes, your modification indeed improved the robustness. upvoted. – Saeid Sep 30 '16 at 07:09
2

Here is a small addendum to sudomakeinstall2's answer.

In the big O notation, you do not care about a constant scaling. Hence, instead of measuring the distance ( y - g(x) )^2 as in his answer, you actually want to measure ( y - k * g(x) )^2, where k is the constant scaling that fits best. This k can be calculated directly as the least squares fit. Here is the modified version, which should be more robust:

...
for func in functions:
    #calculate the best k
    numerator = 0
    denominator = 0
    for n,y in pairs:
        numerator += func(n) * y
        denominator += func(n) * func(n)
    k = numerator / denominator
    d = 0
    for n,y in pairs:
        d += (k * func(n) - y)**2 
    distances.append(d)
...
Nico Schertler
  • 32,049
  • 4
  • 39
  • 70