1

tl;dr

I am working with Python in SQL Server 2017. The Python code is wrapped in a stored procedure to which I pass a query. The query gets evaluated and the data passed to Python. If a string column (char, nchar, varchar, nvarchar) within the query contains NULL, it gets mapped in Python to None. But if a int column contains NULL, it gets mapped to -2147483648 (I guess minimum integer value).

My question is how to get a NULL value from a int column to be None in Python, rather than -2147483648? The column needs to stay int.

Reproducable example

Test data I'm working with:

CREATE TABLE [dbo].[test_table](
    [a-string] [nvarchar](50) NULL,
    [a-date] [date] NULL,
    [a-int] [int] NULL,
    [a-null-int] [int] NULL,
    [a-null-str] [nvarchar](50) NULL
) ON [PRIMARY]
GO
INSERT [dbo].[test_table] ([a-string], [a-date], [a-int], [a-null-int], [a-null-str]) VALUES (N'asdf', CAST(N'2018-04-11' AS Date), 1, NULL, NULL)
INSERT [dbo].[test_table] ([a-string], [a-date], [a-int], [a-null-int], [a-null-str]) VALUES (N'fdsa', CAST(N'2008-04-11' AS Date), 2, NULL, NULL)
INSERT [dbo].[test_table] ([a-string], [a-date], [a-int], [a-null-int], [a-null-str]) VALUES (N'Bob "Bla" Bob', CAST(N'2028-04-11' AS Date), 3, NULL, NULL)
INSERT [dbo].[test_table] ([a-string], [a-date], [a-int], [a-null-int], [a-null-str]) VALUES (N'Bob, Bob', CAST(N'2038-04-11' AS Date), 4, NULL, NULL)
INSERT [dbo].[test_table] ([a-string], [a-date], [a-int], [a-null-int], [a-null-str]) VALUES (N'Bob bob', CAST(N'1998-04-11' AS Date), 5, 1, NULL)

The two last columns contain some NULL values. The first one is of type int, the second type nvarchar.

The code of the stored proc:

CREATE PROCEDURE [dbo].[usp_test]
    @query NVARCHAR(max)
AS
BEGIN
EXEC sp_execute_external_script 
@language = N'Python', 
@script = N'
print(InputDataSet)
',
@input_data_1 = @query
END;

The stored procedure has one parameter with a query which passes the result of the query to the Python code. In the Python code I am printing the data.

How I execute the stored procedure:

EXEC [dbo].[usp_test] N'SELECT [a-string],CAST([a-date] as nvarchar(20)) as [a-date],[a-int],[a-null-int],[a-null-str] FROM [dbo].[test_table]'

The Result is:

        a-string      a-date  a-int  a-null-int a-null-str
0           asdf  2018-04-11      1 -2147483648       None
1           fdsa  2008-04-11      2 -2147483648       None
2  Bob "Bla" Bob  2028-04-11      3 -2147483648       None
3       Bob, Bob  2038-04-11      4 -2147483648       None
4        Bob bob  1998-04-11      5           1       None

The unexpected behaviour is in the column a-null-int. How do I get it to be None rather than -2147483648, while staying int?

This question strongly relates to SQL Server. According this documentation by Microsoft, the BxlServer or SQL Satellite (not sure) handles data transfer between SQL Server and Python. I expect the problem to be in one of those services. But I have no idea on how to circumvent it.

Research done:

Edit1; Is this question a duplicate to the question How to store empty value as an Integerfield?

IMO not. It seems the problem there has been a discrepancy between data types (str vs int). This is not the case here. If I check the data type, I get:

print(type(InputDataSet.ix[0,"a-null-int"]))
>>> <class 'numpy.int32'>

This is correct. I am passing a int column in, and it gets mapped to a python int. But what I need is it to be None.

Edit2; Response to @arun-gurunathan answer:

Before I begin I need to state, that the [a-null-int] column needs to stay integer typed. For context, I need to export data to a CSV. To illustrate my problem, I changed the value of the last row in the [a-null-int] column from NULL to 1. Changed the beginning of the question accordingly.

With RxMissingValues.int32() I get the value used for substituting NULL values, which is -2147483648. I can replace these values with numpy.NaN. It is not a bullet proof fix, because what happens if a column in SQL Server contains exactly this value? Nevertheless, I continued down this path...

The following code I put in the stored procedure above:

import numpy
from revoscalepy import RxMissingValues
InputDataSet.loc[InputDataSet["a-null-int"] == RxMissingValues.int32(), ("a-null-int")] = numpy.NaN
print(InputDataSet)

This is what I've got (abbreviated):

   a-null-int
0         NaN
1         NaN
2         NaN
3         NaN
4         1.0

The [a-null-int] column gets transformed to float. This behavior is documented in pandas doc and has been discussed on stackoverflow.

I expect that my problem is not solvable due to NumPy limitations in handling of NA values. I'll wait some more to see if some more answers pop up on how I can keep the type of column a-null-int as int, or some workaround. Else I will accept @arun-gurunathan answer.

Simon Lang
  • 533
  • 3
  • 18
  • Possible duplicate of [How to store empty value as an Integerfield](https://stackoverflow.com/questions/20399717/how-to-store-empty-value-as-an-integerfield) – Jacob H Apr 19 '18 at 12:48
  • @JacobH I don't think so. Added an explanation to the end of my question. – Simon Lang Apr 19 '18 at 13:02

1 Answers1

1

The rxMissingValues document describes the pandas/numpy limitation of storing None values in integer columns. You can handle these by checking for missing value(rxMissingValues.int32()) as described in the document.

Arun Gurunathan
  • 306
  • 1
  • 5