2

Is there a possibility to work in the datatype of the object to which the apply function is applied? As I understand it, the dtype is changed.

Please see the following MWE. This result is not what I want to achieve.

import pandas as pd
ds_a = pd.Series([True,False,True])
ds_b = ds_a.apply(lambda x: ~x)
print(ds_a.dtype == ds_b.dtype)
print(ds_b.dtype)

results in:

False
int64

ds_b should be the same dtype (boolean) as ds_a. I am interested in how to prevent any data type change.

EDIT: Here is a better MWE for my use-case.

Please see the following (new) MWE.

import pandas as pd
ds_a = pd.Series([True,False,True,True,True,False])
ds_mask = pd.Series([True,False])
func = lambda x: pd.np.all(x==ds_mask)
ds_b = ds_a.rolling(len(ds_mask)).apply(func, raw=True)
print(a(ds_a[:2]).dtype)
print(ds_b.dtype)

results in:

dtype('bool')
float64
James Mchugh
  • 994
  • 8
  • 26
TimK
  • 615
  • 1
  • 7
  • 12
  • https://stackoverflow.com/questions/52078594/pandas-apply-changing-dtype This questions does not answer mine – TimK Nov 14 '19 at 14:13
  • 1
    Are you intentionally doing a bitwise complement instead of a logical `not`? The bitwise complement is treating the booleans as integers, 0 and 1. Then, it is taking the bitwise complement of those values. So `~False = -1` and `~True = -2`. This is causing them to be casted to `int64`. – James Mchugh Nov 14 '19 at 14:26
  • In the above case, casting the values back to `bool` as suggested below would result in all values of `ds_b` to be `True`, since the complement `~` of `True` and `False`, or `1` and `0`, will be always non-zero. – James Mchugh Nov 14 '19 at 14:37
  • @JamesMchugh you are right. I have to check if this applies to my real problem. – TimK Nov 14 '19 at 14:39
  • @JamesMchugh I made a better MWE – TimK Nov 14 '19 at 15:12
  • I have updated my answer. – James Mchugh Nov 14 '19 at 15:52

2 Answers2

3

So the issue is not necessarily that the DataFrame is casting the values. The issue is that the bitwise complement operator ~ is being used as opposed to the logical not operator. This is causing the booleans of True and False to be treated as integers, resulting in the following:

~True = -2
~False = -1

This is what is causing the output DataFrame ds_b to show a dtype of int64. Changing the code to the following should resolve that issue.

import pandas as pd


ds_a = pd.Series([True,False,True])
ds_b = ds_a.apply(lambda x: not x)
print(ds_a.dtype == ds_b.dtype)
print(ds_b.dtype)

However, you are correct that the apply method will make adjustments to the type of the series based on the input. For example, in your case, it converted int to int64. If you come across this behavior in the future and it is undesired, consider the following code.

ds_b = ds_a.apply(lambda x: ~x, convert_dtype=False).astype(ds_a.dtype)

This prevents apply from doing automatic conversions, and at the end it converts the dtype from object to the original type. Here are some timings for you to compare, it does not introduce a significant amount of overhead.

In [26]: %timeit ds_b = ds_a.apply(lambda x: ~x)                                
257 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [27]: %timeit ds_b = ds_a.apply(lambda x: ~x).astype(ds_a.dtype)             
394 µs ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [28]: %timeit ds_b = ds_a.apply(lambda x: ~x, convert_dtype=False).astype(ds_
    ...: a.dtype)                                                               
359 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In your latest example, the Rolling instance automatically tries to handle data as float64. It is more of a limitation of using rolling than it is using a Series or DataFrame apply. As it stands, there is no way to change the datatype for rolling operations within Pandas besides casting the results at the end. For this, I would see the code above for casting the dtype at the end, just omit the convert_dtype parameter for the Rolling object's apply method since it is not applicable.

If you are open to using packages other than Pandas, a rolling function can be implemented using numpy. See the following code:

import numpy as np

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

a = np.array([ True, False,  True,  True,  True, False])
mask = np.array([True, False])

b = (rolling_window(a, 2) == mask).all(axis=1, keepdims=True)

After execution, b is equal to the expected output for your second MVE, except it is in the form of an numpy array.

array([[ True],
       [False],
       [False],
       [False],
       [ True]])
James Mchugh
  • 994
  • 8
  • 26
  • Also I find they keyword in the [doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply) I'm getting an `TypeError` when using rolling and apply, because **unexpected keyword** `convert_dtype` – TimK Nov 14 '19 at 15:11
  • Thank you for your timing. It seems not really important, but it somehow bugs me, what I'm doing wrong – TimK Nov 14 '19 at 15:12
  • 1
    @TimK That is because with the new code you added, you are no longer performing `apply` on a `Series`, you are invoking it on a `window` object. Refer to [these docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.Rolling.apply.html#pandas.core.window.Rolling.apply) for that. The `convert_dtype` parameter does not apply for that method. – James Mchugh Nov 14 '19 at 15:15
  • @TimK Did the answer work for you? If not, how could it be adjusted to better answer your question? If it did, would you mind accepting it? Not only does this allow you to give back to the answerers for taking the time to answer your question, but it also serves to better the SO community by showing users the correct resolution for the issue you faced. This can be a great asset for other users facing the same issue. – James Mchugh Nov 25 '19 at 12:49
  • Thank you very much for your informative answer. If would like to a accept your answer. But as far as I see it, the answer is: Pandas does not yet support rolling windows of other dtypes then float, therefore my approach is not possible. If you clarify this a bit more, I'll accept it. One better solution for me is in this case using only numpy, there I can do what I want without dropping my boolean dtype (especially in large datasets, ...) – TimK Nov 25 '19 at 22:23
  • @TimK While your original question was about the preventing the `apply` method from changing the `dtype` of a Pandas Series, I improved my answer to clarify that it is not possible to do that with Pandas Rolling instances. I also added an implementation for rolling windows in Numpy which persist the type from [here](https://rigtorp.se/2011/01/01/rolling-statistics-numpy.html) and [here](https://stackoverflow.com/questions/6811183/rolling-window-for-1d-arrays-in-numpy). – James Mchugh Nov 26 '19 at 02:56
1

Just add the explicit conversion to boolean in the lambda you are applying

import pandas as pd


ds_a = pd.Series([True,False,True])
ds_b = ds_a.apply(lambda x: bool(~x))
print(ds_a.dtype == ds_b.dtype)
print(ds_b.dtype)
Max Voitko
  • 1,542
  • 1
  • 17
  • 32
  • Does this results in two datatype castings? bool --> int64? --> bool | I know hot get my result in the desired dtype, but I would like to know how to prevent the conversion to happen. I don't understand the reason why apply has to change the data type – TimK Nov 14 '19 at 14:19
  • 1
    I believe @James Mchugh described the reason for the conversion: you have used *bitwise complement operator `~`* instead of *logical `not`*. – Max Voitko Nov 18 '19 at 11:08