2

Problem: Parsing a CSV file client side with Javascript.

First question is what kind of encoding is this ? The following is the result of executing the command:

cat file.csv | xxd

The file is not complete here, you only see the header line and the beginning of the second line.

0000000: 4500 2d00 6d00 6100 6900 6c00 2000 6100  E.-.m.a.i.l. .a.
0000010: 6400 7200 6500 7300 3b00 5200 6f00 6500  d.r.e.s.;.R.o.e.
0000020: 7000 6e00 6100 6100 6d00 3b00 4100 6300  p.n.a.a.m.;.A.c.
0000030: 6800 7400 6500 7200 6e00 6100 6100 6d00  h.t.e.r.n.a.a.m.
0000040: 3b00 4300 7200 6500 6400 6900 7400 6500  ;.C.r.e.d.i.t.e.
0000050: 7500 7200 6e00 7500 6d00 6d00 6500 7200  u.r.n.u.m.m.e.r.
0000060: 3b00 4700 6f00 6500 6400 6b00 6500 7500  ;.G.o.e.d.k.e.u.
0000070: 7200 6400 6500 7200 7300 3b00 4600 7500  r.d.e.r.s.;.F.u.
0000080: 6e00 6300 7400 6900 6500 3b00 4b00 6f00  n.c.t.i.e.;.K.o.
0000090: 7300 7400 6500 6e00 7000 6c00 6100 6100  s.t.e.n.p.l.a.a.
00000a0: 7400 7300 3b00 4200 6500 6800 6500 6500  t.s.;.B.e.h.e.e.
00000b0: 7200 6400 6500 7200 3b00 4400 6500 6300  r.d.e.r.;.D.e.c.
00000c0: 6c00 6100 7200 6100 6e00 7400 3b00 4700  l.a.r.a.n.t.;.G.
00000d0: 6f00 6500 6400 6b00 6500 7500 7200 6400  o.e.d.k.e.u.r.d.
00000e0: 6500 7200 3b00 4500 7800 7000 6f00 7200  e.r.;.E.x.p.o.r.
00000f0: 7400 6500 7500 7200 3b00 4700 6500 6100  t.e.u.r.;.G.e.a.
0000100: 6300 7400 6900 7600 6500 6500 7200 6400  c.t.i.v.e.e.r.d.
0000110: 3b00 5000 6500 7200 7300 6f00 6e00 6500  ;.P.e.r.s.o.n.e.
0000120: 6500 6c00 7300 6e00 7500 6d00 6d00 6500  e.l.s.n.u.m.m.e.
0000130: 7200 3b00 4700 6500 6200 7200 7500 6900  r.;.G.e.b.r.u.i.
0000140: 6b00 6500 7200 7300 6e00 6100 6100 6d00  k.e.r.s.n.a.a.m.
0000150: 3b00 5500 3300 2000 2000 2000 2000 2000  ;.U.3. . . . . .
0000160: 2000 2000 2000 2000 2000 2000 2000 2000   . . . . . . . .
0000170: 2000 2000 2000 2000 2000 2000 2000 2000   . . . . . . . .
0000180: 2000 2000 2000 2000 2000 2000 2000 2000   . . . . . . . .
0000190: 2000 2000 2000 2000 2000 2000 2000 2000   . . . . . . . .
00001a0: 2000 2000 2000 2000 2000 2000 2000 2000   . . . . . . . .
00001b0: 2000 2000 2000 2000 2000 2000 2000 2000   . . . . . . . .
00001c0: 2000 2000 2000 2000 2000 2000 2000 2000   . . . . . . . .
00001d0: 2000 2000 2000 2000 2000 2000 2000 2000   . . . . . . . .
00001e0: 2000 2000 2000 2000 2000 2000 2000 2000   . . . . . . . .
00001f0: 2000 2000 2000 2000 2000 2000 2000 0d00   . . . . . . ...
0000200: 0a00 4100 2e00 4a00 4100 4e00 5300 4500  ..A...J.A.N.S.E.

To parse the file we want to be able to loop over each line. To do that we use the following regex:

lines = str.match(/[^\r\n]+/g)

The result looks like that:

['...\u0000', '\u0000', '\u0000...']

But it should actually look like that:

['... ', 'A...']

If the file is not the problem, what regex can I use to not have the null bytes "breaking" the regex.

Edit:

  • Executing file -I returns application/octet-stream; charset=binary
null
  • 3,959
  • 1
  • 21
  • 28
  • 1
    I think that this is a misuse of the xxd commad. From a quick google search the xxd command creates a hex dump of the file which defeats the purpose of formatting it as a csv. Try running it through your parser without using xxd. – etchesketch May 19 '16 at 14:04
  • I just used the comment so that you can see the exact content of the file, the file. – null May 19 '16 at 14:06
  • The encoding is probably UTF-16 (Little Endian) or UCS-2. – Casimir et Hippolyte May 19 '16 at 14:15
  • You need to read the file with a UTF-16LE decoder. What's your environment, are we talking about a Node app? BTW, the above regex isn't suitable for splitting lines in a CSV file, as it would match newlines in quoted fields. – bobince May 21 '16 at 17:00

1 Answers1

0

Do you really need to write this mechanism yourself? I found this answer to a similar question, which references a jQuery-plugin for reading csv-files. (Not enough rep to comment, so there goes my answer)

Also, since the encoding seems to be 16 bits wide, I don't think regex is the best approach to this. Even if it was only 8 bits, dealing with binary formats is usually really really difficult, unless you know exactly, what every byte of the input is meaning.

Community
  • 1
  • 1
InDieTasten
  • 2,092
  • 1
  • 16
  • 24