3

In Azure Synapse Serverless SQL Pool, while reading JSON documents using OPENROWSET via the CSV parser, we are having to specify the FIELDTERMINATOR and FIELDQUOTE as '0x0b' (vertical tab). Can someone help understand the reason why please?

If it's a single line JSON (JSONL), each line in the file is a JSON document and I would expect the entire line to be read as a single value and not requiring to specify the FIELDTERMINATOR or FIELDQUOTE.

Are we just overriding the default values of the field terminator being comma (,) and the field quite being double quote (")? If so, what is the significance of '0x0b'?

Documentation is here - https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files

Example code from the documentation as below

select top 10 *
from openrowset(
        bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl',
        format = 'csv',
        fieldterminator ='0x0b',
        fieldquote = '0x0b'
    ) with (doc nvarchar(max)) as rows
Ramesh
  • 1,405
  • 10
  • 19

1 Answers1

3

Yes, we are overriding csv default parameters to be compatible with json. These parameters make Synapse read the lines in text file and treat it properly for json/jsonl content. Here is some background why:

OPENROWSET is created for reading files in multiple file formats. Synapse serverless currently supports csv, parquet and delta (delta lake). OPENROWSET does not support json as a separate format. As json and csv are both similar text files, csv-format with suitable parameters is used for reading json. Actually OPENROWSET itself does not treat the file as json or jsonl in this case. It treats the file as a csv/text file with one column and multiple lines. The next steps, e.g. JSON_VALUE or OPENJSON, makes Synapse treat the text as json.

When you specify format='csv' the defaults for fieldterminator and fieldquote are comma and double quote, exactly like you said. These values would be problematic when reading json because your json can contain these characters for other purposes. This is why when reading json files you need to make openrowset ignore fieldterminators and fieldquotes by setting them to some value your json does not contain. Unfortunatelly csv format does not have other ways to switch them off.

Character 0x0b is a vertical tabulator character that nowadays is used very rarely. Because of this it is a common practice to use this character to make openrowset ignore these parameters. You can also use other characters as long as they are not used in your json. More on VT history can be found here: What is a vertical tab?

You can try to include 0x0b character in your json and see that it will raise Synapse error 'Bulk load failed due to invalid column value in CSV data file'.

Veikko
  • 3,372
  • 2
  • 21
  • 31