String delimiter present in string not permitted in Polybase?

Question

I'm creating an external table using a CSV stored in an Azure Data Lake Storage and populating the table using Polybase in SQL Server.

However, I ran into this problem and figured it may be due to the fact that in one particular column there are double quotes present within the string, and the string delimiter has been specified as " in Polybase (STRING_DELIMITER = '"').

HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter

Example:

I have done quite an extensive research in this and found that this issue has been around for years but yet to see any solutions given.

Any help will be appreciated.

Can you provide some sample data please? This is probably the issue. — wBob, Jan 27 '21 at 17:10
i've added a string which is similar to whats present in a column of mine and I think this is the issue that Polybase does not allow something like this — jcoke, Jan 27 '21 at 17:16
The way I would approach this is to *not* specify the string delimiter, import that data and clean it up using SQL. The other way to think about it is, can you write a rule in English that you want the import to follow? — wBob, Jan 27 '21 at 17:30
Assuming your data has other columns, can you please provide a more realistic sample? — wBob, Jan 27 '21 at 18:52
If your example is correct then you are not dealing with properly encoded CSV files. [RFC 4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180) would have you encode that string as: `"Hello ""world"", it's me"` (note the doubled double-quotes around world). — AlwaysLearning, Jan 27 '21 at 21:29
its a JSON file which I am getting from MongoDB and converting to a CSV in Python using Pandas — jcoke, Jan 28 '21 at 08:48
@wBob I understand I can clean it up in SQL but this is the way it needs to be done within Python to perform all the ETL process so cleaning it up before it enters the Data Warehouse. Also I have added an example of what's in the CSV file — jcoke, Jan 28 '21 at 08:50
Hi, I've provided an answer. Not sure if it's quite what you are after but have a look. Basically if you set the delimiter to pipe, leave off the string delimiter you can import the file without error and fix it up using a CTAS later. — wBob, Jan 28 '21 at 21:38

score 1 · Answer 1 · answered Jan 28 '21 at 21:37

I think the easiest way to fix this up because you are in charge of the .csv creation is to use a delimiter which is not a comma and leave off the string delimiter. Use a separator which you know will not appear in the file. I've used a pipe in my example, and I clean up the string once it is imported in to the database.

A simple example:

IF EXISTS ( SELECT * FROM sys.external_tables WHERE name = 'delimiterWorking' )
DROP EXTERNAL TABLE delimiterWorking
GO

IF EXISTS ( SELECT * FROM sys.tables WHERE name = 'cleanedData' )
DROP TABLE cleanedData
GO



IF EXISTS ( SELECT * FROM sys.external_file_formats WHERE name = 'ff_delimiterWorking' )
DROP EXTERNAL FILE FORMAT ff_delimiterWorking
GO

CREATE EXTERNAL FILE FORMAT ff_delimiterWorking
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = '|',
        --STRING_DELIMITER = '"',
        FIRST_ROW = 2,
        ENCODING = 'UTF8'
        )
);
GO


CREATE EXTERNAL TABLE delimiterWorking (
    id                  INT NOT NULL,
    body                VARCHAR(8000) NULL
)
WITH (
    LOCATION = 'yourLake/someFolder/delimiterTest6.txt',
    DATA_SOURCE = ds_azureDataLakeStore,
    FILE_FORMAT = ff_delimiterWorking,
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 0
);
GO


SELECT *
FROM delimiterWorking
GO



-- Fix up the data
CREATE TABLE cleanedData
WITH (
    CLUSTERED COLUMNSTORE INDEX,
    DISTRIBUTION = ROUND_ROBIN
    )
AS
SELECT
    id,
    body AS originalCol,
    SUBSTRING ( body, 2, LEN(body) - 2 ) cleanBody
FROM delimiterWorking
GO



SELECT *
FROM cleanedData

My results:

Venk AV · Answer 2 · 2021-02-02T06:12:55.743

1

String Delimiter issue can be avoided if you have the Data lake flat file converted to Parquet format.

Input:

"ID"	"NAME"	"COMMENTS"
"1"	"DAVE"	"Hi "I am Dave" from"
"2"	"AARO"	"AARO"

Steps:

1 Convert Flat file to Parquet format [Using Azure Data factory]

2 Create External File format in Data Lake [Assuming Master key, Scope credentials available]

CREATE EXTERNAL FILE FORMAT PARQUET_CONV
WITH (FORMAT_TYPE = PARQUET, 
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

3 Create External Table with FILE_FORMAT = PARQUET_CONV

Output:

I believe this is the best option as Microsoft don't have an solution currently to handle this string delimiter occurring with in the data for External table

edited Feb 02 '21 at 06:12

answered Feb 01 '21 at 17:56

Venk AV

67
1
10

does that not remove one of the quotes which should be before the letter `I`? – jcoke Feb 02 '21 at 08:39
@user3396351, correct it's still a bug then, we can introduce escape character for all the occurrence of quotes except(string delimiter). regardless whether it's flat/parquet format – Venk AV Feb 02 '21 at 17:55

String delimiter present in string not permitted in Polybase?

2 Answers2

Steps: