37

I have a dataframe

df = pd.DataFrame(data=np.arange(10),columns=['v']).astype(float)

How to make sure that the numbers in v are whole numbers? I am very concerned about rounding/truncation/floating point representation errors

00__00__00
  • 4,834
  • 9
  • 41
  • 89
  • 1
    How will testing for integers allay concerns about floating-point errors? Do the values come from integers, and you are concerned they have changed? Or are they the results of calculations whose mathematical properties are such that exact results would be integers? – Eric Postpischil Mar 13 '18 at 10:21
  • these values come from integers. However during processing often they are casted to float64 – 00__00__00 Mar 13 '18 at 12:18
  • 1
    The only errors that can occur in handling integers in floating point are rounding and overflow errors when converting from one format to another. When converting integer to floating-point, if the precision does not suffice to represent the value exactly, it will be rounded. However, the value it will be rounded to will be another integer, due to the nature of floating-point. Therefore, testing whether all values in an array are integers will provide no information about whether any rounding errors have occurred. – Eric Postpischil Mar 13 '18 at 12:51
  • 1
    If the task is to ensure that values converted from integer to floating-point do not incur any rounding error, then it suffices if no integer exceeds the precision of the significand of the floating-point format. For example, IEEE 754 basic 64-bit binary has a 53-bit significand, so conversion of any integers up to 2^53 in magnitude will be not incur any rounding error. – Eric Postpischil Mar 13 '18 at 12:54

5 Answers5

49

Comparison with astype(int)

Tentatively convert your column to int and test with np.array_equal:

np.array_equal(df.v, df.v.astype(int))
True

float.is_integer

You can use this python function in conjunction with an apply:

df.v.apply(float.is_integer).all()
True

Or, using python's all in a generator comprehension, for space efficiency:

all(x.is_integer() for x in df.v)
True
Community
  • 1
  • 1
cs95
  • 379,657
  • 97
  • 704
  • 746
  • WHat is the tolerance of allclose compared to is_integer?are they a call to the same function? – 00__00__00 Mar 13 '18 at 07:10
  • 1
    @ErroriSalvo No, the mechanisms are slightly different. With `allclose`, the tolerance is very small to account for floating point inaccuracies. With `is_integer`, the function actually checks for whole numbers. The mechanism is slightly different but the end result is the same. – cs95 Mar 13 '18 at 07:12
  • `allclose` is incapable of determining that a number is an integer unless the tolerance is set to 0, at which point it becomes a test for equality. Furthermore, as stated in my comment to the question, testing for integer values does not accomplish the OP’s actual goal. – Eric Postpischil Mar 13 '18 at 12:56
  • 1
    @EricPostpischil okay, I've changed that to array_equal. By the way, this may be an XY problem, but it is still useful to know how to do this with numpy/pandas, so I've gone ahead and answered anyway. I appreciate the criticism (and the downvote). – cs95 Mar 13 '18 at 13:17
  • `df.v.apply`: not sure if this works, after `df.v` it is a numpy ndarray, which does not have the method `apply`. Do you mean `apply_along_axis`? – Joe Mar 31 '20 at 05:43
16

Here's a simpler, and probably faster, approach:

(df[col] % 1  == 0).all()

To ignore nulls:

(df[col].fillna(-9999) % 1  == 0).all()
scott
  • 161
  • 1
  • 2
7

If you want to check multiple float columns in your dataframe, you can do the following:

col_should_be_int = df.select_dtypes(include=['float']).applymap(float.is_integer).all()
float_to_int_cols = col_should_be_int[col_should_be_int].index
df.loc[:, float_to_int_cols] = df.loc[:, float_to_int_cols].astype(int)

Keep in mind that a float column, containing all integers will not get selected if it has np.NaN values. To cast float columns with missing values to integer, you need to fill/remove missing values, for example, with median imputation:

float_cols = df.select_dtypes(include=['float'])
float_cols = float_cols.fillna(float_cols.median().round()) # median imputation
col_should_be_int = float_cols.applymap(float.is_integer).all()
float_to_int_cols = col_should_be_int[col_should_be_int].index
df.loc[:, float_to_int_cols] = float_cols[float_to_int_cols].astype(int)
mgoldwasser
  • 14,558
  • 15
  • 79
  • 103
7

For completeness, Pandas v1.0+ offer the convert_dtypes() utility, that (among 3 other conversions) performs the requested operation for all dataframe-columns (or series) containing only integer numbers.

If you wanted to limit the conversion to a single column only, you could do the following:

>>> df.dtypes          # inspect previous dtypes
v                      float64

>>> df["v"] = df["v"].convert_dtype()
>>> df.dtypes          # inspect converted dtypes
v                      Int64
ankostis
  • 8,579
  • 3
  • 47
  • 61
1

On 27 331 625 rows it works well. Time : 1.3sec

df['is_float'] = df[field_fact_qty]!=df[field_fact_qty].astype(int)

This way took Time : 4.9s

df[field_fact_qty].apply(lambda x : (x.is_integer()))
Nicoolasens
  • 2,871
  • 17
  • 22