0

Powershell Core 7 is apparently natively BOM-less UTF8 but it still performs exactly the same as Windows Powershell when I use javac on any UTF8 source file that contains accented characters : it encodes the .class file with ANSI character encodings.

For example, this simple program PremierProg.java :

public class PremierProg
{
    public static void main( String[] args )
    {
        System.out.println("Je suis déterminé à apprendre comment coder.");
    }
}

will be compiled then executed with the following output in pwsh :

Je suis déterminé à apprendre comment coder.

I can very obviously add the -encoding "UTF-8" option to my javac call, but isn't the point of cross-platform not having to do any of that? It is actually easier to type wsl javac [MySource.java] and have it output the correct .class file. The same versions of openjdk are installed on both the Windows and Ubuntu sides.

Powershell does read the file correctly as UTF8 :

pwsh reads utf-8

but still interacts with javac using ANSI (even though other forever-utf8-native shells like bash don't have this issue).

Does anyone know why Powershell - even the cross-platform Core version - behaves this way? I really don't want to add anything to a profile.ps1 file or to the javac call. I would just like something that is UTF8-native to fully behave that way.

At the moment, I am getting my students to run bash (via wsl) as their default shell, and that's fine, but this problem still bothers me. I'd like to understand what is going on, and if the solution is at all reasonable, fix it.

The reason for not wanting a profile.ps1 file or extra parameters in the javac call is because this needs to run on school-board controlled devices where scripts are disabled and I am dealing with a majority of first-time programmers.

  • 3
    Here's [the documentation](https://docs.oracle.com/en/java/javase/17/docs/specs/man/javac.html#options) for `-encoding`: "_Specifies character encoding used by source files, such as EUC-JP and UTF-8. If the -encoding option is not specified, then the platform default converter is used_". As you can see, if the option is not specified a default encoding is chosen based on the current platform. Looks like on Windows that default is ANSI, whereas on Linux it is UTF-8. Using PowerShell does not change what platform you're on. However, delegating to WSL means you're on Linux instead of Windows. – Slaw Nov 20 '21 at 19:32
  • 2
    I recommend you use `-encoding UTF-8` when compiling your code, even when you're on Linux. That way there's no ambiguity. – Slaw Nov 20 '21 at 19:33
  • @Slaw I guess that means there is no one-time parameter tweak for this issue. I haven't been able to find a global `javac` configuration tool. – David Crowley Nov 20 '21 at 19:36
  • 1
    Looks like this may no longer be a problem in Java 18: [JEP 400: UTF-8 by Default](https://openjdk.java.net/jeps/400). Unfortunately, it won't be released until March 2022. – Slaw Nov 20 '21 at 19:39
  • @Slaw I may have to, and then sell it with the fact that the students can just use the arrows to retreive previous commands instead of having to retype each time. Trust me, command line versus a Run button is already a hard sell. – David Crowley Nov 20 '21 at 19:39
  • @Slaw that documentation is pretty complete. Knowing that the problem was JDK-specific and not console-specific would have helped me Google better before posting this question... Thank you for speeding that process up for me. Your suggestion to simply use `--encoding UTF-8` is indeed the simplest, most global solution at this time. – David Crowley Nov 20 '21 at 20:04

2 Answers2

2

Thanks to @Slaw 's comments to the original question, the solution actually has nothing to do with PowerShell or any other console but with the platform (Windows, MacOS, Linux) and the JDK.

Java 18 will actually no longer default to the platform charset but to UTF-8, so that will ultimately eliminate this issue.

In the meantime, the best solution seems to be adding the -encoding UTF-8 option to the javac call. Students can save time by using the arrows to retrieve the longer command from their history rather than typing it out each time they need to compile. This solution will still be useful after Java 18 is released simply because it is clear and explicit at the slight cost of being longer.

mklement0
  • 382,024
  • 64
  • 607
  • 775
1

To complement your own answer:

  • Windows 10 offers a - still-in-beta - option to use UTF-8 system-wide (meaning that both the OEM and the ANSI code page are set to 65001, which is UTF-8). While activating this option has the potential to make encoding headaches go away - not just with javac (the active ANSI code page it uses will then effectively be UTF-8), but also with PowerShell in general (see below) - it also has far-reaching consequences - see this answer.

  • If activating system-wide UTF-8 support is not an option, you could work around the problem by defining a wrapper function for javac that hard-codes -encoding utf8 while passing all other arguments through, and place that in your $PROFILE file, so that it becomes available by default in all future sessions:

function javac { javac.exe -encoding utf8 $args }

Note: Functions have higher command precedence than external programs, so when you submit javac, the function is called. If you also wanted to define javac.exe as a wrapper, you could add Set-Alias javac.exe javac, and redefine the javac function body as & (Get-Command -Type Application javac.exe) -encoding utf8 $args


Also note that there are PowerShell-specific character-encoding considerations:

  • As of PowerShell (Core) 7.2, PowerShell console windows unfortunately still default to the system's legacy OEM code page, as reflected in the output from chcp, and, in .NET, [Console]::InputEncoding and [Console]::OutputEncoding.

    • If a given external program outputs text in a different encoding - e.g. UTF-8, [Console]::Encoding must first be set to that encoding in order for PowerShell to decode the output correctly.

    • Caveat: An encoding mismatch may go unnoticed in direct-to-display output, but will surface when PowerShell processes the output, such as sending it through the pipeline, redirecting it to a file, or capturing it in a variable.

  • Conversely, the $OutputEncoding preference variable determines what encoding PowerShell uses to send data to external programs, via the pipeline.

    • Windows PowerShell regrettably defaults to ASCII(!), with any non-ASCII-range characters getting transcoded "lossily" to literal ? chars.

    • PowerShell (Core) 7+ now more sensibly defaults to UTF-8 - although, as stated, on de-coding output it still defaults to the system's OEM code page.

See this answer for a more detailed discussion of PowerShell's encoding behavior and links to helper functions.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Scripts (including $PROFILE) are disabled for the stock Windows Powershell in our organisation and I have to manually authorise the installation of PowerShell Core on each student machine to get around that (no restrictions were setup for that shell's scripts). I have tried all the Powershell encoding tricks outside of a script I have found, and none work more efficiently that just telling `javac` what to do then using the arrow keys to repeat the command later on. We also have the ability to "not use Windows" by using our WSL Ubuntu shell for all the `javac` and `java` work. – David Crowley Nov 22 '21 at 17:10
  • @DavidCrowley I see. So that means profiles _are_ enabled for the PowerShell _Core_ installation? – mklement0 Nov 22 '21 at 17:12
  • exactly... the school board didn't think of restricting a shell that wasn't installed. It worked on my machine at any rate (no errors starting PS Core 7 with $PROFILE, but starting PS would tell me that scripts were disabled). Since it creates at least as many headaches as it solves, I will avoid that solution for my students (although I would use PS Core myself). – David Crowley Nov 22 '21 at 17:17
  • @DavidCrowley, that leaves switching to UTF-8 _system-wide_ as the only no-extra-effort solution, though I can see why that may not be an option. – mklement0 Nov 22 '21 at 17:20
  • 1
    I do give them a *system-wide* utf-8 option via setting Ubuntu (WSL) as the default shell and most students go with that. I think I prefer the second option I give them : going with the `-encoding utf-8` option in the compile command for its expliciteness. It is a tiny bit more work, so I expect it to be less popular. This is intro to programming in high school, so I am not going to make a huge fuss as long as they get it to work consistently. – David Crowley Nov 22 '21 at 21:09