2

Okay, so I've been bashing my head against the table over this one.

I am importing an XML file that was exported by Indesign. This parses it and creates a file based on the input. (I'm building a JS application with Node)

This file looks good in my PHPStorm IDE. But when I open it in gedit, i see some unwanted newlines here and there.

I've managed to track it down to this character: ->
<- (it really is there - copy it somewhere and move your cursor using the arrow keys over it. Its stuck in the middle).

This character viewed by a hex editor reveals it to be 0x80 0xE2 0xA9

When I tried to replace it using a simple javascript replace;

data = data.replace('
', ''); //There IS a character in the left one. Trust me.

I got the following parse error;

enter image description here

In vim it shows the following character at that place; ~@�

How am I going to remove that from my output? Escaping the character in the JS code caused it to compile just fine, but then the weird character is still there. I'm out of ideas.

Álvaro González
  • 142,137
  • 41
  • 261
  • 360
Rob
  • 4,927
  • 4
  • 26
  • 41

2 Answers2

3

You need to use '\u2029' as the search string. The sequence you are trying to replace is a "paragraph separator" Unicode character inserted by InDesign.

So:

string.replace('\u2029', '');

instead of the character itself.

goran
  • 1,005
  • 1
  • 7
  • 7
  • Cool, thanks! I'll check this out tomorrow. What would be a quick method to find the unicode version of a UTF-8 character? Because I couldn't find it that easily. – Rob Nov 25 '15 at 19:00
  • @RobQuist I've linked one in the edit to my answer. Happy hunt! – Álvaro González Nov 25 '15 at 19:58
  • 1
    That was it :) string.replace(/[\u2029]/g, ''); done fixed it. Thanks a lot goran! – Rob Nov 26 '15 at 09:10
  • Seems like U+2028 was also in the "XML" – Rob Nov 26 '15 at 09:34
  • 1
    You need to use `\u2029` instead of the actual char for the same reason that you need to use `\n` instead of a regular line feed: unlike other languages, the JavaScript syntax doesn't allow line feeds inside string literals. – Álvaro González Nov 26 '15 at 09:49
  • 2
    @RobQuist: the easiest way is to look up the literal character in an online service. This is the result of that "empty" character you provided: http://www.fileformat.info/info/unicode/char/search.htm?q=%E2%80%A9&preview=entity – goran Nov 27 '15 at 20:31
3

String.replace() doesn't work exactly the way you think. The way you use it, it'll only replace the first occurrence:

> "abc abc abc".replace("a", "x");
'xbc abc abc'

You need to add the g (global) flag and the only standard way is to use regular expression as match:

> "abc abc abc".replace(/a/g, "x");
'xbc xbc xbc'

You can have a look at Fastest method to replace all instances of a character in a string for further ideas.


A search for 0x80 0xE2 0xA9 as UTF-8 shows the character doesn't exist but it's probably a mistype for 0xE2 0x80 0xA9 which corresponds to 'PARAGRAPH SEPARATOR' (U+2029) as Goran points out in his answer. You don't normally need to encode exotic characters as JavaScript \u#### reference as long as all your tool-set is properly configured to use UTF-8 but, in this case, the JavaScript engine considers it a line feed and triggers a syntax error because you aren't allowed to have line feeds in JavaScript strings.

Community
  • 1
  • 1
Álvaro González
  • 142,137
  • 41
  • 261
  • 360
  • Thanks for the headsup :) I know about the replace function, i actually never even use strings but straight up regexes - but that caused troubles, hence the string method. Good to mention though! – Rob Nov 25 '15 at 19:00
  • I think goran has a point, I've edited my answer to add further info. – Álvaro González Nov 25 '15 at 19:56
  • So yeah I wrote it down in the wrong order, thats why I couldn't find anything. Turns out U+2028 was also in the XML. – Rob Nov 26 '15 at 09:35