pd.cut not bucketing values into intervals even though value is there

Question

I am trying to use pd.cut to create specific buckets. This works for most data but there is a subset that it puts into nan where there is a clear value. I have provided an example df

    numbers         difference_interval
0   0.000000e+00    nan
1   3.263739e-03    nan
2   3.637279e-02    nan
3   5.308298e-03    nan
4   -1.139971e-01   nan
5   nan             nan

Here is the code I used to create the intervals:

bins = pd.IntervalIndex.from_tuples([(-1, -.2), (-.2, -.1), (-.1, -.05), (-.05, 0), (0, .05), (0.05, .1), (0.1, .2), (0.2, 1)])
col = 'numbers'

df = (df.dropna(subset=col)
.assign(difference_interval= lambda df: pd.cut(df[col].values, bins).sort_values().astype(str)))

df.query('difference_interval == "nan"')

Why would this happening?

I see values for `difference_interval` after running your code. — not_speshal, Aug 01 '23 at 13:51
I believe you have some outliers or corner cases in `numbers` on which the code produces `nan`. But because you sorted the produced sequence of bins with no reason, the output mismatch with the original data. So you see `nan` in the wrong records. — Vitalizzare, Aug 01 '23 at 14:43
Sort value moves row wise though right? It won't move a nan value into a difference row. Maybe I'm misunderstanding — geds133, Aug 01 '23 at 15:18
Why are you using `.astype(str)`? Is it possible that `numbers` are also strings? That'd cause the problem, and it'd explain why nan is written `nan` instead of `NaN`. Please make a [mre], including desired output and undesired (current) output. For specifics see [How to make good reproducible pandas examples](/q/20109391/4518341). — wjandrea, Aug 01 '23 at 15:24
@geds133 put intentionally `-1` in the middle of your data to see `nan` at the very end of the output. — Vitalizzare, Aug 01 '23 at 15:25
@wjandrea As mentioned I can't reproduce the issue even though I have tried. I'm happy to close — geds133, Aug 01 '23 at 15:59

score 0 · Answer 1 · answered Aug 01 '23 at 14:12

0

There are three strange things:

Are you sure you want to sort the bins? This results in wrong mapping of numbers and assigned bins (code below is without sorting).
Do you expect dropna to do anything? In this toy(?) example it doesn't.
Also, your assignment is rather complicated, the code below is a simplified version of yours.


bins = pd.IntervalIndex.from_tuples([
        (-1, -.2),
        (-.2, -.1),
        (-.1, -.05),
        (-.05, 0),
        (0, .05),
        (0.05, .1),
        (0.1, .2),
        (0.2, 1)
])

col = 'numbers'

df = pd.DataFrame(
    columns=["numbers", "difference_interval"],
    data=[
        [0.000000e+00  ,  np.nan],
        [3.263739e-03  ,  np.nan],
        [3.637279e-02  ,  np.nan],
        [5.308298e-03  ,  np.nan],
        [-1.139971e-01 ,  np.nan]
    ]
)

df["bins"] = pd.cut(df[col].values, bins).astype(str)

output is as expected

numbers difference_interval bins
0   0.000000    NaN (-0.05, 0.0]
1   0.003264    NaN (0.0, 0.05]
2   0.036373    NaN (0.0, 0.05]
3   0.005308    NaN (0.0, 0.05]
4   -0.113997   NaN (-0.2, -0.1]

answered Aug 01 '23 at 14:12

Klops

951
6
18

IMO that's a reasonable remark, but not an answer to OP – Vitalizzare Aug 01 '23 at 14:38
you are right, I guess... this was my best guess to solve a non-reproducible problem :-/ – Klops Aug 01 '23 at 15:15
I have updated the answer to add a nan value. This also explains why I have used .assign as I wanted to method chain here and if I simply use `df`, then the initial method of `dropna` has not actually been applied to `df` hence the lambda function. – geds133 Aug 01 '23 at 15:16
If it doesn't reproduce then I guess let's close it - not sure how to reproduce without the actual data – geds133 Aug 01 '23 at 15:16
In my dataset, these are nan no matter what method I use – geds133 Aug 01 '23 at 15:17

pd.cut not bucketing values into intervals even though value is there

1 Answers1