0

I am trying to find a simple JS way to convert RTF to plain text and I found that simple solution which seems to be satisfactory for my needs. However, all my RTF is in Portuguese, with some Latin1 characters which are not replaced by the mentioned functions.

I just placed one more regexp to substitute RTF´s "\'hh" sequences by Javascript´s "\xhh", so I have:

function convertToPlain(rtf) {
    rtf = rtf.replace(/\\par[d]?/g, "")

    rtf = rtf.replace(/\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?/g, "").trim()

    rtf = rtf.replace(/\\'/g, '\\x')

    return rtf;
}

The replacements happen. But, playing with the code in JSFiddle, I can´t get the returned string with its "\xhh" sequences substituted. Here´s a sample of the result:

 a inaugura\xe7\xe3o do novo Castel\xe3o, para as competi\xe7\xf5es

However, if I change the return statement to use the above sample as a literal, like...

return " a inaugura\xe7\xe3o do novo Castel\xe3o, para as competi\xe7\xf5es"

... the characters are substituted as expected:

 a inauguração do novo Castelão, para as competições

It seems that something happens with the string variable (but not to a string literal) that causes its special characters not to be substituted. However, I could not find any explanation for this here in SO, nor in MSDN, W3C, books I have, whatsoever.

Could somebody please shed a light here? Thanks!

Fabricio

Community
  • 1
  • 1
  • 1
    Unfortunately you can't automatically convert escape codes within any string just like that unless they're presented to the scripting engine within a string literal, because that's when escape codes are processed and at no other time. You likely need to parse the resulting string for the escape codes and replace them with the correct latin1 character manually. `String.replace(RegExp,function)` would be my first go-to to do this. – Xeren Narcy Jan 11 '17 at 22:59

1 Answers1

1

You are getting a string returned with escaped characters, and you need to unescape them, simple as that I imagine. There's no magic in strings to automatically unescape escaped character sequences, and rightfully so (otherwise how could you store them?).

I think you are looking for this:

How do I decode a string with escaped unicode?

The common method expressed there is to use unescape(JSON.parse(...)) (see examples via the link), otherwise you have to scan and convert them yourself (the accepted answer on that page).

There is another way using eval('"'+s+'"'), but never do that on text you are receiving from server side. It can be ok if you are 100% sure it is safe to do so (even Doug Crockford uses it in his JSON parser).

Here is the code from the accepted answer, edited for your case:

var x = "a inaugura\\xe7\\xe3o do novo Castel\\xe3o, para as competi\\xe7\\xf5es";
var r = /\\u([\d\w]{4})|\\x([\d\w]{2})/gi;
x = x.replace(r, function (match, grp, grp2)) {
    return String.fromCharCode(parseInt(grp||grp2, 16)); } );
x = unescape(x);
console.log(x);

Result:

a inauguração do novo Castelão, para as competições

Note: The code change was mainly in the regex, adding |\\x([\d\w]{2}) and changing {4} to {2}, and to support \x, because you are using 1 byte hex escaped characters (0x??, for characters under 256) instead of the 2-byte unicode \u???? method.

Community
  • 1
  • 1
James Wilkins
  • 6,836
  • 3
  • 48
  • 73
  • Thank you, James. It did not work as is, until I tried to use only your contribution to the original code: the escaped chars were simply removed, but then I changed the regex to `/\\x([\d\w]{2})/gi` and the characters were correctly substituted. Thank you once more! – Fabricio Rocha Jan 12 '17 at 14:07
  • Sorry, I forgot adding mine introduced TWO groups, lol, so now it's handled correctly. The code above is now corrected to work with both cases. – James Wilkins Jan 12 '17 at 18:58