We are building a Python 3 program which calls a Java program. The Java program (which is a 3rd party program we cannot modify) is used to tokenize strings (find the words) and provide other annotations. Those annotations are in the form of character offsets.
As an example, we might provide the program with string data such as "lovely weather today"
. It provides something like the following output:
0,6
7,14
15,20
Where 0,6
are the offsets corresponding to word "lovely", 7,14
are the offsets corresponding to the word "weather" and 15,20
are offsets corresponding to the word "today" within the source string. We read these offsets in Python to extract the text at those points and perform further processing.
All is well and good as long as the characters are within the Basic Multilingual Plane (BMP). However, when they are not, the offsets reported by this Java program show up all wrong on the Python side.
For example, given the string "I feel today"
, the Java program will output:
0,1
2,6
7,9
10,15
On the Python side, these translate to:
0,1 "I"
2,6 "feel"
7,9 " "
10,15 "oday"
Where the last index is technically invalid. Java sees "" as length 2, which causes all the annotations after that point to be off by one from the Python program's perspective.
Presumably this occurs because Java encodes strings internally in a UTF-16esqe way, and all string operations act on those UTF-16esque code units. Python strings, on the other hand, appear to operate on the actual unicode characters (code points). So when a character shows up outside the BMP, the Java program sees it as length 2, whereas Python sees it as length 1.
So now the question is: what is the best way to "correct" those offsets before Python uses them, so that the annotation substrings are consistent with what the Java program intended to output?