tl;dr
I am working with Python in SQL Server 2017. The Python code is wrapped in a stored procedure to which I pass a query. The query gets evaluated and the data passed to Python. If a string column (char, nchar, varchar, nvarchar) within the query contains NULL
, it gets mapped in Python to None
. But if a int
column contains NULL
, it gets mapped to -2147483648
(I guess minimum integer value).
My question is how to get a NULL
value from a int
column to be None
in Python, rather than -2147483648
? The column needs to stay int
.
Reproducable example
Test data I'm working with:
CREATE TABLE [dbo].[test_table](
[a-string] [nvarchar](50) NULL,
[a-date] [date] NULL,
[a-int] [int] NULL,
[a-null-int] [int] NULL,
[a-null-str] [nvarchar](50) NULL
) ON [PRIMARY]
GO
INSERT [dbo].[test_table] ([a-string], [a-date], [a-int], [a-null-int], [a-null-str]) VALUES (N'asdf', CAST(N'2018-04-11' AS Date), 1, NULL, NULL)
INSERT [dbo].[test_table] ([a-string], [a-date], [a-int], [a-null-int], [a-null-str]) VALUES (N'fdsa', CAST(N'2008-04-11' AS Date), 2, NULL, NULL)
INSERT [dbo].[test_table] ([a-string], [a-date], [a-int], [a-null-int], [a-null-str]) VALUES (N'Bob "Bla" Bob', CAST(N'2028-04-11' AS Date), 3, NULL, NULL)
INSERT [dbo].[test_table] ([a-string], [a-date], [a-int], [a-null-int], [a-null-str]) VALUES (N'Bob, Bob', CAST(N'2038-04-11' AS Date), 4, NULL, NULL)
INSERT [dbo].[test_table] ([a-string], [a-date], [a-int], [a-null-int], [a-null-str]) VALUES (N'Bob bob', CAST(N'1998-04-11' AS Date), 5, 1, NULL)
The two last columns contain some NULL
values. The first one is of type int
, the second type nvarchar
.
The code of the stored proc:
CREATE PROCEDURE [dbo].[usp_test]
@query NVARCHAR(max)
AS
BEGIN
EXEC sp_execute_external_script
@language = N'Python',
@script = N'
print(InputDataSet)
',
@input_data_1 = @query
END;
The stored procedure has one parameter with a query which passes the result of the query to the Python code. In the Python code I am printing the data.
How I execute the stored procedure:
EXEC [dbo].[usp_test] N'SELECT [a-string],CAST([a-date] as nvarchar(20)) as [a-date],[a-int],[a-null-int],[a-null-str] FROM [dbo].[test_table]'
The Result is:
a-string a-date a-int a-null-int a-null-str 0 asdf 2018-04-11 1 -2147483648 None 1 fdsa 2008-04-11 2 -2147483648 None 2 Bob "Bla" Bob 2028-04-11 3 -2147483648 None 3 Bob, Bob 2038-04-11 4 -2147483648 None 4 Bob bob 1998-04-11 5 1 None
The unexpected behaviour is in the column a-null-int
. How do I get it to be None
rather than -2147483648
, while staying int
?
This question strongly relates to SQL Server. According this documentation by Microsoft, the BxlServer or SQL Satellite (not sure) handles data transfer between SQL Server and Python. I expect the problem to be in one of those services. But I have no idea on how to circumvent it.
Research done:
- Microsoft Tutorial for Python in SQL Server: learn.microsoft.com
- SQL Server 2017 architecture for running Python: https://learn.microsoft.com/en-us/sql/advanced-analytics/python/new-components-in-sql-server-to-support-python-integration?view=sql-server-2017
- Pandas support for integer
NA
: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
Edit1; Is this question a duplicate to the question How to store empty value as an Integerfield?
IMO not. It seems the problem there has been a discrepancy between data types (str vs int). This is not the case here. If I check the data type, I get:
print(type(InputDataSet.ix[0,"a-null-int"]))
>>> <class 'numpy.int32'>
This is correct. I am passing a int
column in, and it gets mapped to a python int
. But what I need is it to be None
.
Edit2; Response to @arun-gurunathan answer:
Before I begin I need to state, that the [a-null-int]
column needs to stay integer typed. For context, I need to export data to a CSV. To illustrate my problem, I changed the value of the last row in the [a-null-int]
column from NULL
to 1
. Changed the beginning of the question accordingly.
With RxMissingValues.int32()
I get the value used for substituting NULL
values, which is -2147483648
. I can replace these values with numpy.NaN
. It is not a bullet proof fix, because what happens if a column in SQL Server contains exactly this value? Nevertheless, I continued down this path...
The following code I put in the stored procedure above:
import numpy
from revoscalepy import RxMissingValues
InputDataSet.loc[InputDataSet["a-null-int"] == RxMissingValues.int32(), ("a-null-int")] = numpy.NaN
print(InputDataSet)
This is what I've got (abbreviated):
a-null-int 0 NaN 1 NaN 2 NaN 3 NaN 4 1.0
The [a-null-int]
column gets transformed to float
. This behavior is documented in pandas doc and has been discussed on stackoverflow.
I expect that my problem is not solvable due to NumPy limitations in handling of NA
values. I'll wait some more to see if some more answers pop up on how I can keep the type of column a-null-int
as int
, or some workaround. Else I will accept @arun-gurunathan answer.