0

I am currently developing an application relating to unicode characters.

As the unicode characters would have to be read in python to determine the language before passing on to Java for processing. However, currently I am reading the file first using python to determine the language before calling upon the corresponding Java engine to process it.

This method takes too long as there is too much I/O cost involved, but directly passing the unicode characters as an argument does not work, it throws an error:

'charmap' codec cant encode characters in position xx - xx: character maps to <undefined>. 

What I would like to do (excerpt of my code):

#reads in the unicode char 
str = "some unicode words"
command = "java -jar unicodeProcessor.jar " + str
subprocess.Popen(command, stdout = PIPE, stderr = PIPE)

Java processes it and writes it to a file.

Currently,

#determines what is the language. 
filepath = "filepath of text file"
command = "java -jar unicodeProcessor.jar " + filepath
subprocess.Popen(command, stdout = PIPE, stderr = PIPE)
#in this method I am taking the parameter to be a file instead of a string 

This method is too slow.

Current code :

unic = open("unicode_words.txt")
words = unic.read()
if ininstance(words, str):
    convert = unicode(words, 'utf-8')
else: 
    convert = words

command = "java -jar unicodeProcessor.jar " + convert
subprocess.Popen(command, stdout = PIPE, stderr = PIPE)
aceminer
  • 4,089
  • 9
  • 56
  • 104
  • 1
    What is your question? Explain "does not work." –  Mar 02 '15 at 09:23
  • @Lutz Horn updated my question – aceminer Mar 02 '15 at 09:26
  • So what codec does your Java application expect? You cannot just write Unicode strings to it; implicit encoding failed. – Martijn Pieters Mar 02 '15 at 09:27
  • Where do you use `charmap`? Show us the input that gives this error and the code that throws it. –  Mar 02 '15 at 09:27
  • @MartijnPieters What do you mean by it. I can encode them in utf-8. But before i passed them to the stdout. I already have converted them to utf-8. – aceminer Mar 02 '15 at 09:28
  • @Lutz Horn i do not use charmap. It just throws up an error in Ipython – aceminer Mar 02 '15 at 09:28
  • @aceminer: you ran into an implicit encoding issue, but you haven't shared the code that throws the exception, so there is little we can do to help there. – Martijn Pieters Mar 02 '15 at 09:30
  • @aceminer: passing them to the subprocess stdin should just work, provided you encode the data correctly. But without code we cannot help you here. – Martijn Pieters Mar 02 '15 at 09:30
  • @MartijnPieters This is my code. – aceminer Mar 02 '15 at 09:34
  • @aceminer: and the full traceback is? – Martijn Pieters Mar 02 '15 at 09:35
  • can your `java` process read the input from stdin? `with open('unicode_words.txt', 'rb', 0) as file: out = subprocess.check_output(['java', '-jar', 'unicodeProcessor.jar'], stdin=file, stderr=subprocess.STDOUT)`? – jfs Mar 02 '15 at 09:40
  • does `convert.encode('mbcs')` work? If not; can you upgrade to Python 3? – jfs Mar 02 '15 at 09:42
  • Java bases its default encoding on the locale, better make it explicit by adding `-Dfile.encoding="utf-8"` to your java subprocess call... – swenzel Mar 02 '15 at 09:53
  • related: [Unicode filename to python subprocess.call()](http://stackoverflow.com/q/2595448/4279), see Python issue [subprocess.Popen doesn't support unicode on Windows](http://bugs.python.org/issue19264) – jfs Mar 02 '15 at 10:03
  • @J.F.Sebastian Yes my java code can read input from stdin. In fact i can modify both java and python code. However, my java code does not seem to run when the unicode text is piped in. It simply hangs there. If i were to pass the unicode characters as a file there is no issue. – aceminer Mar 03 '15 at 02:12
  • @aceminer: [using standard streams to pass data uncorrupted can be a challenge on Windows](http://stackoverflow.com/q/8669056/4279). If input is not large; you could pass it as json text (`json.dumps(convert)` produces ascii-only string by default). – jfs Mar 03 '15 at 09:50

0 Answers0