1

I wanted to know if there's a way to exclude one or more data regions in a polynomial fit. Currently this doesn't seem to work as I would expect. Here a small example:

import numpy as np
import pandas as pd
import zfit

# Create test data
left_data = np.random.uniform(0, 3, size=1000).tolist()
mid_data = np.random.uniform(3, 6, size=5000).tolist()
right_data = np.random.uniform(6, 9, size=1000).tolist()
testsample = pd.DataFrame(left_data + mid_data + right_data, columns=["x"])

# Define fit parameter
coeff1 = zfit.Parameter('coeff1', 0.1, -3, 3)
coeff2 = zfit.Parameter('coeff2', 0.1, -3, 3)

# Define Space for the fit
obs_all = zfit.Space("x", limits=(0, 9))

# Perform the fit
bkg_fit = zfit.pdf.Chebyshev(obs=obs_all, coeffs=[coeff1, coeff2], coeff0=1)
new_testsample = zfit.Data.from_pandas(obs=obs_all, df=testsample.query("x<3 or x>6"), weights=None)
nll = zfit.loss.UnbinnedNLL(model=bkg_fit, data=new_testsample)
minimizer = zfit.minimize.Minuit()
result = minimizer.minimize(nll)

TestSample.png

Here I've created a small testsample with 3 uniformly distributed data. I only want to use the data in x < 3 OR x > 6 and ignore the 'peak' in between. Because of their equal shape and height, I'd expect that coeff1 and coeff2 would be at (nearly) zero and the fitted curve would be a straight, horizontal line. Obviously this doesn't happen because zfit assumes that there're just no entries between 3 and 6.

I also tried using MultiSpaces to ignore that region via

limit1 = zfit.Space("x", limits=(0, 3))
limit2 = zfit.Space("x", limits=(6, 9))
obs_data = limit1 + limit2

But this leads to a

ValueError: obs need to be a Space with exactly one limit if rescaling is requested.

Anyone has an idea how to solve this?

Thanks in advance ^^

lweid
  • 13
  • 2
  • Are you trying to fit both LOW and HIGH simultaneously/jointly? If `zfit` accepts weights, you could include MID but with infinite/large error bars i.e. very low/zero weights. Otherwise you can find another library whose fitting will take weights. Perhaps discussion in https://github.com/zfit/zfit/issues/193 helps – jtlz2 Jan 13 '22 at 09:48
  • Have you looked at degeneracies (or covariances) between your fitted parameters? Also, you could impute/overwrite the MID region with noise that looks like the wings - not idea but is one option. – jtlz2 Jan 13 '22 at 09:57
  • Are you wedded to `zfit` or are you open to other options? – jtlz2 Jan 13 '22 at 10:21
  • I don't want to use a simultaneous fit here (if that's what you meant). I want to perform a single fit, using the data in the LOW and HIGH region. But I'll try the 'MID region with infinite error bars' solution and inform you how it performed ^^ – lweid Jan 13 '22 at 10:31
  • I mean to say (and they are same): A single (= simultaneous = joint) fit of your two-parameter model over the (LOW and HIGH) data. Try the weights and let us know how you get on :) – jtlz2 Jan 13 '22 at 10:32
  • In this example I could add noise in the MID region easily, yes. But at the end I want to perform a fit over data with an unknown shape. There I would have no idea how to shape/generate the data in that region and would have to guess a model for that (and I would add an unwanted bias there as well). Currently I'm trying to do everything with zfit if possible. If I'm not able to do what I described above with it, I'm open for other options :) – lweid Jan 13 '22 at 10:32
  • @jtlz2 zfit does indeed support weights, so technically you could use zero weights, but it's anyway not the problem: you can just use a smaller data set to "remove" the points, the difficulty is with the normalization range and that won't change anything with the weights. I think there are not many libraries anyway (do you know of any?) which allow to set an individual normalization range – Mayou36 Jan 13 '22 at 10:51

1 Answers1

0

Indeed, this is a bit of a tricky problem, but that may just needs a small update in zfit.

What you are doing is correct: simply use only the data in the desired region. However, this is not the whole story because there is a "normalization range": probabilistically speaking, it's like a conditioning on a certain region as we know the data can only be in a specific region. Hence the normalization of the PDF should only integrate over the included (LOW and HIGH) regions.

This can normally be done in two ways:

Using multispace

using the multispace property as you do. This should work (it is though most probably not the way to go in the future), except for a quirk in the polynomial function: the polynomials are defined from -1 to 1. Currently, the data is simply rescaled therefore to be within -1 and 1 (and for that it should use the "space" property of the PDF). This, currently, requires to be a simple space (which could also be allowed in principle, using the minimum and maximum of the limits).

Simultaneous fit

As mentioned in the comments by @jtlz2, you can do a simultaneous fit. That is nothing to worry about, it is simply splitting the likelihood into two parts. As it is a product of probabilities, we can just conceptually split it into two products and multiply (or add their log).

So you can have the pdf fit the lower region and the upper at the same time. However, this does not solve the problem of the normalization: what should the PDF be normalized to? We will run into the same problem.

Solution 1: different space and norm

Space and the normalization range are however not the same. By default, the space (usually called 'obs') is also used as the default normalization range but not required. So you could use one space going from the lowest to the largest point as the obs and then set the norm range with your multispace (set_norm should do it or set_norm_range if you're using not the newest version). This, I think, should do the trick.

Solution 2: manual re-scaling

The actual problem is that it complains about the re-scaling to -1 and 1 that can't be done. Every polynomial which does that can also be told not to do that by using the apply_scaling=False argument. With that, you're responsible to scale the data within -1 and 1 (as the polynomials are not defined outside) and there should not be any error.

Mayou36
  • 4,613
  • 2
  • 17
  • 20
  • 1
    Thank you very much, also for all these detailed information. Your suggested 'Solution 1' already solved my problem. I've nothing to add ^^ – lweid Jan 13 '22 at 12:56