Machine Learning related Question: SQL Server 2017 Python Service
Error: MemoryError - File "cyproject\util.pyx", line 10, in cyproject.util.json_to_list (I'm getting this error if I tried to SELECT > 7000 records)
I'm having large record sets in one of my SQL Server database Table namely "sample_data".
The Schema of the table is
CREATE TABLE [dbo].[sample_data](
[index] [bigint] IDENTITY(0,1) NOT NULL PRIMARY KEY,
[binarydata] [varbinary](max) NOT NULL,
[dataset] AS CAST(DECOMPRESS([binarydata]) AS varchar(max))
[datacount] [int] NOT NULL,
)
The binary data contains the value of JSON string COMPRESS(N'[1,2,3,4,5,6]')
Sample Data
INSERT INTO [dbo].[sample_data]
([binarydata]
,[datacount])
VALUES
(COMPRESS(N'[1,2,3,4,5,6]')
, 6)
Note: [dbo].[sample_data].[dataset] is an auto-computed column, it will return the DECOMPRESS form of [dbo].[sample_data].[binarydata]. (i.e., N'[1,2,3,4,5,6]')
I written a CYTHON method to convert the JSON String (i.e., [dbo].[sample_data].[dataset]) to appropriate List.
CYTHON Method:
def json_to_list(list json_array):
cdef:
list result = []
str item
for item in json_array:
result.append(np.array(json.loads(item)))
return result
I integrated the said CYTHON method in SQL Server 2017 PYTHON Service:
Sample Code: SQL SERVER SCRIPT (PYTHON Service)
DECLARE @pscript NVARCHAR(MAX);
SET @pscript = N'
# assign SQL Server dataset to df
df = InputDataSet
import cyproject.util
json_array = df.dataset.values
result = cyproject.util.json_to_list(json_array.tolist())
';
DECLARE @sqlscript NVARCHAR(MAX);
SET @sqlscript = N'
SELECT TOP (7500)
[dataset]
FROM [dbo].[sample_data]';
EXEC sp_execute_external_script
@language = N'Python',
@script = @pscript,
@input_data_1 = @sqlscript
Error: MemoryError - File "cyproject\util.pyx", line 10, in cyproject.util.json_to_list (I'm getting this error if I tried to SELECT > 7000 records)
The total records present in the table is approx 74K, it may grow. Moreover the said table contains only compressed data (Size of the Table is 1.822 GB), the Un-compressed data size is approx > 9 GB.
In the initial execution, it dumps the entire record to Pandas DataFrame and I'm taking only the column values. Finally I'm passing those list of values to the CYTHON method and its returns the list of numpy array
.
I can't able to process morethan 7K records I'm getting Memory Error. Kindly assist me how to achieve this.