5

We are building a Python 3 program which calls a Java program. The Java program (which is a 3rd party program we cannot modify) is used to tokenize strings (find the words) and provide other annotations. Those annotations are in the form of character offsets.

As an example, we might provide the program with string data such as "lovely weather today". It provides something like the following output:

0,6
7,14
15,20

Where 0,6 are the offsets corresponding to word "lovely", 7,14 are the offsets corresponding to the word "weather" and 15,20 are offsets corresponding to the word "today" within the source string. We read these offsets in Python to extract the text at those points and perform further processing.

All is well and good as long as the characters are within the Basic Multilingual Plane (BMP). However, when they are not, the offsets reported by this Java program show up all wrong on the Python side.

For example, given the string "I feel today", the Java program will output:

0,1
2,6
7,9
10,15

On the Python side, these translate to:

0,1    "I"
2,6    "feel"
7,9    " "
10,15  "oday"

Where the last index is technically invalid. Java sees "" as length 2, which causes all the annotations after that point to be off by one from the Python program's perspective.

Presumably this occurs because Java encodes strings internally in a UTF-16esqe way, and all string operations act on those UTF-16esque code units. Python strings, on the other hand, appear to operate on the actual unicode characters (code points). So when a character shows up outside the BMP, the Java program sees it as length 2, whereas Python sees it as length 1.

So now the question is: what is the best way to "correct" those offsets before Python uses them, so that the annotation substrings are consistent with what the Java program intended to output?

NanoWizard
  • 2,104
  • 1
  • 21
  • 34
  • Could you be more explicit about what you are actually seeing as output? Those numbers you give are not the correct Unicode code points. – markspace May 23 '19 at 17:10
  • I didn't provide any unicode code points. Those are character offsets for the string passed to the Java program. I'll try to make that more clear in the text. – NanoWizard May 23 '19 at 17:12
  • Then we'll need the actual data. Not more discussion. We need to see what the program is actually returning as output before we have any chance of guessing how we might read it. – markspace May 23 '19 at 17:13
  • I'm not quite following. I've provided an example of the Java program output. If you want the exact byte sequence (in hex) sent to the Java program for the example text "I feel today" it is `492066656c20f09f998220746f646179` in the form of a UTF-8 encoded file that the Java program reads. – NanoWizard May 23 '19 at 17:29
  • 1
    This isn’t Java’s fault. It’s the fault of whoever wrote that Java program. Those person(s) wrongly assumed all characters are BMP characters. There are standard ways in Java to traverse Strings by Unicode codepoints instead of UTF-16 chars. I recommend letting them know of their mistake. – VGR May 23 '19 at 17:36
  • @VGR yes absolutely we will be informing the developers of this oversight. But we need an immediate solution until their software is fixed. – NanoWizard May 23 '19 at 17:42

2 Answers2

5

You could convert the string to a bytearray in UTF16 encoding, then use the offsets (multiplied by 2 since there are two bytes per UTF-16 code-unit) to index that array:

x = "I feel  today"
y = bytearray(x, "UTF-16LE")

offsets = [(0,1),(2,6),(7,9),(10,15)]

for word in offsets:
  print(str(y[word[0]*2:word[1]*2], 'UTF-16LE'))

Output:

I
feel

today

Alternatively, you could convert every python character in the string individually to UTF-16 and count the number of code-units it takes. This lets you map the indices in terms of code-units (from Java) to indices in terms of Python characters:

from itertools import accumulate

x = "I feel  today"
utf16offsets = [(0,1),(2,6),(7,9),(10,15)] # from java program

# map python string indices to an index in terms of utf-16 code units
chrLengths = [len(bytearray(ch, "UTF-16LE"))//2 for ch in x]
utf16indices = [0] + list(itertools.accumulate(chrLengths))
# reverse the map so that it maps utf16 indices to python indices
index_map = dict((x,i) for i, x in enumerate(utf16indices))

# convert the offsets from utf16 code-unit indices to python string indices
offsets = [(index_map[o[0]], index_map[o[1]]) for o in utf16offsets]

# now you can just use those indices as normal
for word in offsets:
  print(x[word[0]:word[1]])

Output:

I
feel

today

The above code is messy and can probably be made clearer, but you get the idea.

Blorgbeard
  • 101,031
  • 48
  • 228
  • 272
  • I initially thought this was great! However, it forces us to refactor all of our existing python code to operate on those bytes instead of strings... so while that's possible I'd reaaaaaallly like to avoid it... If there was a way to use this to remap those offsets from the UTF-16 offsets to the python unicode string offsets, that would be preferable... – NanoWizard May 23 '19 at 17:36
  • Perfect! And probably much more efficient than my solution using for loops. Thank you! – NanoWizard May 23 '19 at 18:25
1

This solves the problem given the proper encoding, which, in our situation appears to be 'UTF-16BE':

def correct_offsets(input, offsets, encoding):
  offset_list = [{'old': o, 'new': [o[0],o[1]]} for o in offsets]

  for idx in range(0, len(input)):
    if len(input[idx].encode(encoding)) > 2:
      for o in offset_list:
        if o['old'][0] > idx:
          o['new'][0] -= 1
        if o['old'][1] > idx:
          o['new'][1] -= 1

  return [o['new'] for o in offset_list]

This may be pretty inefficient though. I gladly welcome any performance improvements.

NanoWizard
  • 2,104
  • 1
  • 21
  • 34