4

I am trying to output unicode text to an RTF file from a python script. For background, Wikipedia says

For a Unicode escape the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode UTF-16 code unit number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter bāʼ ب, specifying that older programs which do not have Unicode support should render it as a question mark instead.

There is also this question on outputting RTF from Java and this one on doing so in C#.

However, what I can't figure out is how to output the unicode code point as a "16-bit signed decimal integer with the Unicode UTF-16 code unit number" from Python. I've tried this:

for char in unicode_string:
    print '\\' + 'u' + ord(char) + '?',

but the output only renders as gibberish when opened in a word processor; the problem appears to be that it's not the UTF-16 code number. But not sure how to get that; though one can encode in utf-16, how does one get the code number?

Incidentally PyRTF does not support unicode (it's listed as a "todo"), and while pyrtf-NG is supposed to do so, that project does not appear to be maintained and has little documentation, so I am wary of using it in a quasi-production system.

Edit: My mistake. There are two bugs in the above code - as pointed out by Wobble below the string has to be a unicode string, not an already encoded one, and the above code produces a result with spaces between characters. The correct code is this:

convertstring=""
for char in unicode(<my_encoded_string>,'utf-8'):
    convertstring = convertstring + '\\' + 'u' + str(ord(char)) + '?'

This works fine, at least with OpenOffice. I am leaving this here as a reference for others (one mistake further corrected after discussion below).

Community
  • 1
  • 1
ShankarG
  • 1,105
  • 11
  • 26
  • 1
    ShankarG: the actual spec from Microsoft doesn't use Wikipedia's "16-bit signed decimal integer" wording (which is good, because unlike the WP editor, MS's people know that there are no negative unicode codepoints and mentioning that it's signed would be dumb). All you need to take from it is that you want `\u` followed by a number up to 32767. – Wooble Mar 28 '12 at 13:55
  • Thanks for the clarification. But how do I get the correct number? The output of ord() doesn't seem to be the correct one. – ShankarG Mar 28 '12 at 15:26
  • 1
    `ord()` seems to be producing `1576` for me. Are you sure what you have is a unicode string and not utf-8 bytes? – Wooble Mar 28 '12 at 15:27
  • Thanks, that was in fact an issue. I had implicitly assumed that ord's output would be the same either way, but obviously that's not true. However, there's still a problem, in that I'm trying to use Devnagari text (i.e. Hindi) and Devnagari characters are often multi-byte in nature - they are rendering in the rtf text as separate characters rather than correctly. So it still seems like the unicode numbering might be wrong? – ShankarG Mar 28 '12 at 15:39
  • Are you using a Unicode string or a UTF-8 encoded byte string? Show us an example of the `repr` of an actual string you're trying to output. – Mark Ransom Mar 28 '12 at 16:46
  • No, it's my mistake; the problem was that 1) the string was encoded, as noteda bove, and 2) the code I've included above produces a result with spaces in it which was confusing the parser. Am editing with the correct code now. – ShankarG Mar 28 '12 at 16:58
  • I still see a problem in your latest code sample - if you really have an encoded string, you should use `my_encoded_string.decode('utf8')` rather than `unicode(my_encoded_string)`. Also this will convert every character, even if it's ASCII. – Mark Ransom Mar 28 '12 at 17:06
  • @Wooble: RTF uses unsigned int throughout; the RTF document (at least the more recent ones) do mention that codepoints beyond 32767 are to be adjusted (subtract 65536). This is both for RTF control codes as a whole and for the `\u` control code in particular. – Martijn Pieters Jun 06 '12 at 20:24

2 Answers2

3

Based on the information in your latest edit, I think this function will work properly. Except see the improved version below.

def rtf_encode(unistr):
    return ''.join([c if ord(c) < 128 else u'\\u' + unicode(ord(c)) + u'?' for c in unistr])

>>> test_unicode = u'\xa92012'
>>> print test_unicode
©2012
>>> test_utf8 = test_unicode.encode('utf-8')
>>> print test_utf8
©2012
>>> print rtf_encode(test_utf8.decode('utf-8'))
\u169?2012

Here's another version that's broken down a little to be easier to understand. I also made it consistent in returning an ASCII string rather than keeping Unicode and flubbing it at the join. It also incorporates a fix based on the comments.

def rtf_encode_char(unichar):
    code = ord(unichar)
    if code < 128:
        return str(unichar)
    return '\\u' + str(code if code <= 32767 else code-65536) + '?'

def rtf_encode(unistr):
    return ''.join(rtf_encode_char(c) for c in unistr)
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • Thanks for this. You're right about my code converting every character rather than only the non ASCII ones, though in practice that should not affect final output (in an ideal world :) ). Regarding using "decode" instead of "unicode", according to [this](http://effbot.org/zone/unicode-objects.htm) the two have the same functionality, though you are correct in that I should have explicitly specified the encoding by saying unicode (, 'utf-8'). – ShankarG Mar 29 '12 at 04:34
  • @ShankarG, I didn't realize the `unicode` function had additional parameters, I just thought it would fail when you gave it non-ASCII. Thanks for informing me. – Mark Ransom Mar 29 '12 at 12:59
  • Actually, this is still incorrect. The RTF standard uses *signed* 16-bit integers, so values over 32767 are represented as negative numbers (substract 65536). – Martijn Pieters Jun 05 '12 at 10:37
  • @MartijnPieters, I don't know why it took so long for me to see your comment. Hopefully my edit is a complete fix. – Mark Ransom Feb 27 '13 at 21:44
  • @MarkRansom: That certainly looks better. :-) – Martijn Pieters Feb 27 '13 at 21:49
1

Mark Ransom's answer isn't quite correct as it'll not encode codepoints over U+7fff correctly, nor will it escape characters below 0x20 as recommended by the RTF standard.

I've created a simple module that encodes python unicode to RTF control codes called rtfunicode, and wrote about the subject on my blog.

In summary, my method uses a regular expression to map the right codepoints to RTF control codes suitable for inclusion in either PyRTF or pyrtf-ng.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343