1

I have a arff file which has following attributes:

@ATTRIBUTE "åäö" NUMERIC
@ATTRIBUTE "åøã" NUMERIC

The file is saved with UTF-8. I am reading this file in my Java application using weka API. I can run the program without any issue from Eclipse.

However, when I am trying to run the program from powershell, or command prompt (simply using java -jar my-app.jar -data path/to/mydata.arff), I am facing the below error:

java.io.IOException: Unable to determine structure as arff (Reason: java.lang.IllegalArgumentException: Attribute names are not unique! Causes: 'å??' ).

at weka.core.converters.ArffLoader.getStructure(ArffLoader.java:1204)

at weka.core.converters.ArffLoader.getDataSet(ArffLoader.java:1234)

at weka.core.converters.ConverterUtils$DataSource.getDataSet(ConverterUtils.java:269)

I tried to change the encoding (default is OEM United States (IBM437)) as below.

Attempt1: Set UTF-8 encoding in my ps1 script as below (source):

$OutputEncoding = New-Object -typename System.Text.UTF8Encoding
[Console]::OutputEncoding = New-Object -typename System.Text.UTF8Encoding

This didn't help, only changed the console output to ...Causes: '�??'... from ...Causes: 'å??'....

Attempt2: Changing the encoding directly on console as below (source):

$OutputEncoding = [Console]::OutputEncoding

This too didn't work.

Is there anyway this can be fixed?

Update: This question is not a duplicate of Printing Unicode characters to the PowerShell prompt, as in my case it does not matter whether whether the right character is displayed on the command prompt or not, as my program does not attempt to do so. Also, please note that the answer of the said question (using [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(850)) produced the exactly same result, and thus provided no solution to this problem. Additionally, executing using PowerShell ISE, and ConEmu also didn't help.

I assume that if the correct encoding can be set for the 'session' (or environment/context, not sure how to call this) it would be enough for my program to process the arff file correctly. However, I am not sure how.

Community
  • 1
  • 1
Sayan Pal
  • 4,768
  • 5
  • 43
  • 82
  • 1
    I thought your question was about displaying the `Causes: 'å??'` message correctly - but if you don't care about that, why are you changing the PowerShell output encoding? You pass a filename to Java, and Java doesn't read the file content as UTF-8 properly, that's not anything to do with the shell, the console character encoding, the PowerShell output format, etc. If it finds the file by name, the shell part is over. It seems like it has to be down to the particular version of Java.exe you call or the environment variables set by Eclipse being different from the defaults, maybe? – TessellatingHeckler May 26 '17 at 20:59
  • @TessellatingHeckler Thank you for comment. Solved this by setting JVM's options. – Sayan Pal May 26 '17 at 22:43

1 Answers1

0

Following @TessellatingHeckler's comment, I have solved this by setting the JVM's encoding option: by adding System Variable JAVA_TOOL_OPTIONS, and then setting the value to -Dfile.encoding=utf-8 (source: https://stackoverflow.com/a/24265723/2270340).

Now, every time I start java the following shows that the set options have been taken into account:

Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf-8

I am posting this answer to share my findings. If there is a better way to do this, please post an answer/comment.

Sayan Pal
  • 4,768
  • 5
  • 43
  • 82