unexpected characters in JSON file

Question

Problem: Extra characters in JSON files from Azure DataFactory v1.

I have two files which are straight copies from two Cosmos db collections.
I used Data Factory v1, selecting defaults, to copy the collections to a Blob storage container then used Azure Storage Explorer to copy the JSON files to a Windows 10 desktop.

A. Looking at the files using an editor vi/Notrepad/Wordpad/Ultra Edit/Visual Studio Code they look OK.

B. When I attempt to read the files into a simple Nodejs(v9) application the I get JSON parse error:

SyntaxError: C:\Users\ricko\Desktop\whippy\MorpheusDataProduction-01052018-
20.json: Unexpected token � in JSON at positi
on 0
at JSON.parse (<anonymous>)
at Object.Module._extensions..json (module.js:654:27)
at Module.load (module.js:554:32)
at tryModuleLoad (module.js:497:12)
at Function.Module._load (module.js:489:3)
at Module.require (module.js:579:17)
at require (internal/module.js:11:18)
at Object.<anonymous> (C:\Users\ricko\Desktop\whippy\appUpdateJSON.js:5:17)
at Module._compile (module.js:635:30)
at Object.Module._extensions..js (module.js:646:10)

C. A single Line validates using JSON http://jsonlint.com. Multiple lines do not validate giving a Parse Error:

    Error: Parse error on line 15:
    ..."_ts": 1512601730} { "path": "Dropbox\
    ----------------------^
    Expecting 'EOF', '}', ',', ']', got '{'

D. Also using node to read the file directly then writing a record to the console I get a wierd double spaced version. I also saw two �� characters at the beginning of the line in front of the open brace in one attempt. (example below)

   { " S T E P _ N A M E " : " O p e n s   T e s t " , " N A M E " : " A r t 
    h u r   J o b e r t " , " D A T E " : " 1 2
   1 9 / 2 0 1 6   3 : 5 7 : 4 7   P M " , " L O T " : " C G 1 5 " , " W A F 
   E R " : " " , " P R O C E S S _ S T E P " : "
   ...

well, this is probably due to lf\crlf issue (or encoding issue). vscode is smart enough to display the file, but nodejs probably isnt — 4c74356b41, Jan 12 '18 at 19:50
The file is probably Unicode and your code is expecting ASCII. — Duston, Jan 12 '18 at 20:06

score 0 · Answer 1 · answered Jan 12 '18 at 20:11

I would assume it's an encoding issue. The two unprintable characters at the beginning are called BOM and denote the encoding. Smart editors can handle this. UltraEdit has a hex mode where you can see the real content byte for byte in hex form. Notepad++ is very powerful to convert the encoding to nearly everything you would use, with or without BOM.

C. I would guess there is a comma missing between the closing and opening bracket.

D. Here the encoding seems to be unicode (fix two byte sized characters). Verify it with UltraEdit's hex mode.

I do not know Azure, but most programming languages are able to handle the byte-to-character encoding correctly if you indicate the encoding you need. You always have to be aware of this problem when you have to serialize text to a byte array (for sending over a line like socket) and vice versa.

Do you assume I can guess **what** exactly you have tried and **what** exactly "did not work"? — Heri, Jan 13 '18 at 11:38

score 0 · Accepted Answer · answered Jan 16 '18 at 19:49

0

Error is due to the Unicode BOM (Byte Order Mark) which are the hidden characters at the beginning of the file.

Answer can be found here: node.js readfile error with utf8 encoded file on windows

answered Jan 16 '18 at 19:49

rick o

3
3

unexpected characters in JSON file

2 Answers2