0

Background:

I ran into problem executing code from a machine learning case. I've already solved the issue with an ugly workaround so I am able to execute the notebook, but I still do not fully understand the cause of the issue.

The issues arises when I try to execute the following code which is used to create dummy variables using OneHotEncoder from sklearn.

categorical_columns = ~np.in1d(train_X.dtypes, [int, float])

Although the codes executes without any error, it fails to recognize the numpy.int64 as int datatype therefore classifying all int64 datatype columns as categorical and parsing them into the OneHotEncoder.

train_X is a pandas dataframe object with the following columns and datatypes, as you can see the integers are stored as numpy.int64.

dataframe

The code was originally written in Jupyter Notebook on a Mac where it worked fine and it also ran fine in Colaboraty on the Google cloud. All others who tried running the code from Jupyter on their almost identical Windows machines had the same issue as I did when running the script.

The Problem:

It seems that on windows machines, the numpy.int64 is not linked to the native int datatype.

Things I've tried and verified

  1. Although dated and based on python 2.7.x this post made me believe it was a version issue, so I verified:
    • My machine is running on a 64bit version of windows 10
    • Python is installed as 64 bit
    • Anaconda is also installed as 64 bit
    • Used a clean environment with just pandas, numpy, sklearn and dependencies, all updated to their lastest version
    • When I run python I get the following:

terminal

I noted the strange "on win32" here but it seems merely a product of the "infinite wisdom of Microsoft" according to post 1 and post 2

  1. I tried understanding the issue by reading 1, 2 and 3. I've managed to compute several workarounds based on these but I still do not understand why the code works on one system but not on another.

Question:

Why does numpy.int64 not translate into a native int datatype on Windows while everything is running 64 bit, where it does on Mac and other systems?

Jelle Hoekstra
  • 662
  • 3
  • 16

1 Answers1

3

I don't have an answer as to why the default int on Windows 64 is int32 but it is a very confusing fact:

np.dtype('int') returns dtype('int32') on 64 bit Windows and dtype('int64') on 64 bit Linux.

See also the second warning here and this numpy github issue.

In your concrete case I'd use pandas' is_numeric_dtype function to check numeric-ness in a platform independed and straightforward way:

from pandas.api.types import is_numeric_dtype
categorical_columns = ~train_X.dtypes.apply(is_numeric_dtype).to_numpy()
Stef
  • 28,728
  • 2
  • 24
  • 52
  • The default integer in numpy is closely tied to `long` type in C, [see docs](https://numpy.org/doc/1.17/user/basics.types.html). The C standard doesn't specify the size of a `long` exactly, only that it's at least 32 bits ([wikipedia link](https://en.wikipedia.org/wiki/C_data_types#Basic_types)). The actual size depends on the compiler and cpu architecture... where Windows x64 / MS Visual C++ compiler is kinda unique for not making a `long` 64 bits in size. – user7138814 Sep 07 '19 at 20:42
  • @user7138814: thanks for the comment. Do you know whether the Inter C++ compiler on Windows 64 handles longs differently? – Stef Sep 07 '19 at 20:58
  • The strange thing here is that ```long64``` does translate to native ```long``` as well as ```long32```. ```int32``` also translates fine into ```int``` – Jelle Hoekstra Sep 09 '19 at 11:24
  • @Stef thanks for the suggestion! It worked after importing the function as it is not imported with pandas by default ```import pandas.api.types as ptypes``` (thanks to this [post](https://stackoverflow.com/questions/28596493/asserting-columns-data-type-in-pandas)) – Jelle Hoekstra Sep 09 '19 at 11:37
  • yes, sorry - forgot to indictate the import, added it to the answer – Stef Sep 09 '19 at 11:39
  • @user7138814 is there a way to force it to be 64 like a registery setting or something? Or will that cause other problems? At least I can momentarily check if it is the real cause of this problem. – Jelle Hoekstra Sep 09 '19 at 11:44
  • @Jelle no you would need to compile numpy from the source code, and the same for all other libraries that build on numpy, like pandas. – user7138814 Sep 17 '19 at 13:11