0

Problem: Extra characters in JSON files from Azure DataFactory v1.

I have two files which are straight copies from two Cosmos db collections.
I used Data Factory v1, selecting defaults, to copy the collections to a Blob storage container then used Azure Storage Explorer to copy the JSON files to a Windows 10 desktop.

A. Looking at the files using an editor vi/Notrepad/Wordpad/Ultra Edit/Visual Studio Code they look OK.

B. When I attempt to read the files into a simple Nodejs(v9) application the I get JSON parse error:

SyntaxError: C:\Users\ricko\Desktop\whippy\MorpheusDataProduction-01052018-
20.json: Unexpected token � in JSON at positi
on 0
at JSON.parse (<anonymous>)
at Object.Module._extensions..json (module.js:654:27)
at Module.load (module.js:554:32)
at tryModuleLoad (module.js:497:12)
at Function.Module._load (module.js:489:3)
at Module.require (module.js:579:17)
at require (internal/module.js:11:18)
at Object.<anonymous> (C:\Users\ricko\Desktop\whippy\appUpdateJSON.js:5:17)
at Module._compile (module.js:635:30)
at Object.Module._extensions..js (module.js:646:10)

C. A single Line validates using JSON http://jsonlint.com. Multiple lines do not validate giving a Parse Error:

    Error: Parse error on line 15:
    ..."_ts": 1512601730} { "path": "Dropbox\
    ----------------------^
    Expecting 'EOF', '}', ',', ']', got '{'

D. Also using node to read the file directly then writing a record to the console I get a wierd double spaced version. I also saw two �� characters at the beginning of the line in front of the open brace in one attempt. (example below)

   { " S T E P _ N A M E " : " O p e n s   T e s t " , " N A M E " : " A r t 
    h u r   J o b e r t " , " D A T E " : " 1 2
   1 9 / 2 0 1 6   3 : 5 7 : 4 7   P M " , " L O T " : " C G 1 5 " , " W A F 
   E R " : " " , " P R O C E S S _ S T E P " : "
   ...
rick o
  • 3
  • 3

2 Answers2

0

I would assume it's an encoding issue. The two unprintable characters at the beginning are called BOM and denote the encoding. Smart editors can handle this. UltraEdit has a hex mode where you can see the real content byte for byte in hex form. Notepad++ is very powerful to convert the encoding to nearly everything you would use, with or without BOM.

C. I would guess there is a comma missing between the closing and opening bracket.

D. Here the encoding seems to be unicode (fix two byte sized characters). Verify it with UltraEdit's hex mode.

I do not know Azure, but most programming languages are able to handle the byte-to-character encoding correctly if you indicate the encoding you need. You always have to be aware of this problem when you have to serialize text to a byte array (for sending over a line like socket) and vice versa.

Heri
  • 4,368
  • 1
  • 31
  • 51
0

Error is due to the Unicode BOM (Byte Order Mark) which are the hidden characters at the beginning of the file.

Answer can be found here: node.js readfile error with utf8 encoded file on windows

rick o
  • 3
  • 3