0

Machine Learning related Question: SQL Server 2017 Python Service

Error: MemoryError - File "cyproject\util.pyx", line 10, in cyproject.util.json_to_list (I'm getting this error if I tried to SELECT > 7000 records)

I'm having large record sets in one of my SQL Server database Table namely "sample_data".

enter image description here

The Schema of the table is

CREATE TABLE [dbo].[sample_data](
        [index] [bigint] IDENTITY(0,1) NOT NULL PRIMARY KEY,
        [binarydata] [varbinary](max) NOT NULL,
        [dataset] AS CAST(DECOMPRESS([binarydata]) AS varchar(max))
        [datacount] [int] NOT NULL,
)

The binary data contains the value of JSON string COMPRESS(N'[1,2,3,4,5,6]')

Sample Data

INSERT INTO [dbo].[sample_data]
           ([binarydata]
           ,[datacount])
     VALUES
           (COMPRESS(N'[1,2,3,4,5,6]')
           , 6)

Note: [dbo].[sample_data].[dataset] is an auto-computed column, it will return the DECOMPRESS form of [dbo].[sample_data].[binarydata]. (i.e., N'[1,2,3,4,5,6]')

I written a CYTHON method to convert the JSON String (i.e., [dbo].[sample_data].[dataset]) to appropriate List.

CYTHON Method:

def json_to_list(list json_array):
    cdef:
        list result = []
        str item

    for item in json_array:
        result.append(np.array(json.loads(item)))

    return result

I integrated the said CYTHON method in SQL Server 2017 PYTHON Service:

Sample Code: SQL SERVER SCRIPT (PYTHON Service)

DECLARE @pscript NVARCHAR(MAX);
SET @pscript = N'
# assign SQL Server dataset to df
df = InputDataSet

import cyproject.util

json_array = df.dataset.values

result = cyproject.util.json_to_list(json_array.tolist())
';

DECLARE @sqlscript NVARCHAR(MAX);
SET @sqlscript = N'
  SELECT TOP (7500) 
     [dataset]
  FROM [dbo].[sample_data]';

EXEC sp_execute_external_script
  @language = N'Python',
  @script = @pscript,
  @input_data_1 = @sqlscript

Error: MemoryError - File "cyproject\util.pyx", line 10, in cyproject.util.json_to_list (I'm getting this error if I tried to SELECT > 7000 records)

The total records present in the table is approx 74K, it may grow. Moreover the said table contains only compressed data (Size of the Table is 1.822 GB), the Un-compressed data size is approx > 9 GB.

In the initial execution, it dumps the entire record to Pandas DataFrame and I'm taking only the column values. Finally I'm passing those list of values to the CYTHON method and its returns the list of numpy array.

I can't able to process morethan 7K records I'm getting Memory Error. Kindly assist me how to achieve this.

Sreedharan
  • 83
  • 1
  • 3
  • 8
  • The question has nothing to do with `machine-learning` - kindly do not spam the tag (removed). – desertnaut Jun 13 '18 at 10:02
  • @desertnaut - No issues. Thanks. – Sreedharan Jun 13 '18 at 11:12
  • You could try making sure that the numpy `dtype` is the smallest suitable data type (e.g. `np.int16` or `np.int32`). You could also try to use `list.pop` on `json_data` to free some memory as you go (I don't know how efficient it'll be to go from front to back with `pop` though) – DavidW Jun 13 '18 at 18:54
  • In case it helps someone in future, it might be possible to address the memory error by increasing the memory available to Python by altering the external resource pool. See my answer here: https://stackoverflow.com/a/69082468/1714 – Hobo Sep 07 '21 at 04:39

0 Answers0