-1

I am trying to use pd.cut to create specific buckets. This works for most data but there is a subset that it puts into nan where there is a clear value. I have provided an example df

    numbers         difference_interval
0   0.000000e+00    nan
1   3.263739e-03    nan
2   3.637279e-02    nan
3   5.308298e-03    nan
4   -1.139971e-01   nan
5   nan             nan

Here is the code I used to create the intervals:

bins = pd.IntervalIndex.from_tuples([(-1, -.2), (-.2, -.1), (-.1, -.05), (-.05, 0), (0, .05), (0.05, .1), (0.1, .2), (0.2, 1)])
col = 'numbers'

df = (df.dropna(subset=col)
.assign(difference_interval= lambda df: pd.cut(df[col].values, bins).sort_values().astype(str)))

df.query('difference_interval == "nan"')

Why would this happening?

geds133
  • 1,503
  • 5
  • 20
  • 52
  • I see values for `difference_interval` after running your code. – not_speshal Aug 01 '23 at 13:51
  • 1
    I believe you have some outliers or corner cases in `numbers` on which the code produces `nan`. But because you sorted the produced sequence of bins with no reason, the output mismatch with the original data. So you see `nan` in the wrong records. – Vitalizzare Aug 01 '23 at 14:43
  • Sort value moves row wise though right? It won't move a nan value into a difference row. Maybe I'm misunderstanding – geds133 Aug 01 '23 at 15:18
  • Why are you using `.astype(str)`? Is it possible that `numbers` are also strings? That'd cause the problem, and it'd explain why nan is written `nan` instead of `NaN`. Please make a [mre], including desired output and undesired (current) output. For specifics see [How to make good reproducible pandas examples](/q/20109391/4518341). – wjandrea Aug 01 '23 at 15:24
  • @geds133 put intentionally `-1` in the middle of your data to see `nan` at the very end of the output. – Vitalizzare Aug 01 '23 at 15:25
  • @wjandrea As mentioned I can't reproduce the issue even though I have tried. I'm happy to close – geds133 Aug 01 '23 at 15:59

1 Answers1

0

There are three strange things:

  • Are you sure you want to sort the bins? This results in wrong mapping of numbers and assigned bins (code below is without sorting).

  • Do you expect dropna to do anything? In this toy(?) example it doesn't.

  • Also, your assignment is rather complicated, the code below is a simplified version of yours.


bins = pd.IntervalIndex.from_tuples([
        (-1, -.2),
        (-.2, -.1),
        (-.1, -.05),
        (-.05, 0),
        (0, .05),
        (0.05, .1),
        (0.1, .2),
        (0.2, 1)
])

col = 'numbers'

df = pd.DataFrame(
    columns=["numbers", "difference_interval"],
    data=[
        [0.000000e+00  ,  np.nan],
        [3.263739e-03  ,  np.nan],
        [3.637279e-02  ,  np.nan],
        [5.308298e-03  ,  np.nan],
        [-1.139971e-01 ,  np.nan]
    ]
)

df["bins"] = pd.cut(df[col].values, bins).astype(str)


output is as expected

numbers difference_interval bins
0   0.000000    NaN (-0.05, 0.0]
1   0.003264    NaN (0.0, 0.05]
2   0.036373    NaN (0.0, 0.05]
3   0.005308    NaN (0.0, 0.05]
4   -0.113997   NaN (-0.2, -0.1]
Klops
  • 951
  • 6
  • 18
  • IMO that's a reasonable remark, but not an answer to OP – Vitalizzare Aug 01 '23 at 14:38
  • you are right, I guess... this was my best guess to solve a non-reproducible problem :-/ – Klops Aug 01 '23 at 15:15
  • I have updated the answer to add a nan value. This also explains why I have used .assign as I wanted to method chain here and if I simply use `df`, then the initial method of `dropna` has not actually been applied to `df` hence the lambda function. – geds133 Aug 01 '23 at 15:16
  • If it doesn't reproduce then I guess let's close it - not sure how to reproduce without the actual data – geds133 Aug 01 '23 at 15:16
  • In my dataset, these are nan no matter what method I use – geds133 Aug 01 '23 at 15:17