I am trying to pipe some Unicode characters from Python to Java.
Python code:
thai = u"ฉันจะกลับบ้านในคืนนี้"
command = "java - jar tokenizer.jar " + thai
p = subprocess.Popen(command, stdout = subprocess.PIPE, stdin = subprocess.PIPE, stderr = subprocess.PIPE)
I plan to pipe them into Java via args[]
.
The results of the tokenizer was different when I ran it in Java like this:
public static void main(String[] args)
{
String thai = "ฉันจะกลับบ้านในคืนนี้"
ThaiAnalyzer ana = new ThaiAnalyzer();
ana.analyze(thai)
}
vs
public static void main(String[] args)
{
String thai;
thai = args[0] // "ฉันจะกลับบ้านในคืนนี้"(this string should be passed from python)
ThaiAnalyzer ana = new ThaiAnalyzer();
ana.analyze(args[0])
}
I believe it to be an encoding issue.
Pardon my short Java code as I do not have the code now with me.
What am i trying to say is for example if i were to pipe it from python to java to tokenize this string
"Hi i am going home"
I might end up with
"Hi", "i", "am", "going", "home"
if i were to use the former method
and the latter method might yield something like
"Hi i", "am", "going home"
My question is due to the difference in the results of the output. I am using english to illustrate my problem.