5

I have a Dataframe with two columns which I scatter plotted and got something like the following picture:

Scatterplot

I would like to know if there is a way to find a distribution curve who best fits it, since the tutorials I've found focus in the distribution of one variable only (e.g. this case. I'm looking for something like this:

What I'm looking for

Does anyone have any directions or sample code for this case?

Code Different
  • 90,614
  • 16
  • 144
  • 163
Eduardo Sousa
  • 875
  • 10
  • 22

1 Answers1

0

You can try fitting different degrees of polynomial using numpy.polyfit. It takes x, y and degree of fitting polynomial as inputs.

You can write a loop which iterates from 1 to say 5 for the degrees. Plot the f(x) using the coefficients which are returned by the function.

for d in degrees:

  • Fit using np.polyfit(x, y, d)

  • Get coefficients and optionally plot f(x) for degree d

  • Then find sum of squares (yi - f(xi))^2

Note that the sum of squares is just an indication of the error - in general it would go down as the degree increases but the plotting will kind of show you if you are overfitting to the data.

This is just one of the ways to go about solving the problem.

v_a
  • 98
  • 1
  • 6
  • This can be a really good first step, but I'm aiming for something like: "This is a normal distribution or this is a Weibull distribution". – Eduardo Sousa Aug 21 '19 at 02:45
  • 3
    @EduardoSousa I think these distributions are generally meant for univariate data. If you are plotting something on x axis and y axis and trying to fit a line through it, you are trying to find f(x) which best fits given y – v_a Aug 21 '19 at 02:48