0

Is there a way through install script/Windows batch/PowerShell that I will be able to check if a file is UTF-8 before passing it for conversion?

As a background, I am currently working on a legacy (Japanese) Windows software developed with Visual Studio 2005 (Upgraded to Visual Studio 2017) in C++.

I am dealing with a requirement to make GUI be able display and input Chinese characters. Thus the decision to use UNICODE for the project/solution encoding.

Since the project was originally using Multibyte, to be backwards compatible with UNICODE I have decided to encode configuration files (ini, dat, save files) in UTF-8 as these files are also referenced by a web application.

The main bits of the software are now done and working, and I am left with one last problem - rolling out a version up installer.

In this installer (using Install script), I am required to update save files (previously encoded in SHIFT-JIS as these save files contains Japanese text) to UTF-8.

I have already created a batch file in the following lines which converts SHIFT-JIS to UTF-8, which is called at the last part of the installer and is deleted after conversion.

@echo off
:: Shift_JIS -> UTF-8
setlocal enabledelayedexpansion
for %%f in ("%~dp0\savedfiles\*.sav") do (
    echo %%~ff| findstr /l /e /i ".sav"
      if !ERRORLEVEL! equ 0 (
        powershell -nop -c "&{[IO.File]::WriteAllText($args[1], [IO.File]::ReadAllText($args[0], [Text.Encoding]::GetEncoding(932)))}" \"%%~ff"  \"%%~ff" 
      )
)

However, the problem with this is that when the user (1) upgrades, (2) uninstalls (.sav files are left behind on purpose) and (3) re-installs the software the save files are doubly re-encoded and results in the software crashing. (UTF-8 Japanese characters updated during (1) upgrade, become garbage characters after (3) re-installation.)

phuclv
  • 37,963
  • 15
  • 156
  • 475
Wolf
  • 91
  • 1
  • 7
  • There's nothing special about a collection of bytes that's supposed to represent text that indicates what encoding it uses. You must keep track of this some other way. – Sam Varshavchik Sep 28 '20 at 01:31
  • 1
    For an answer, I recommend looking at [A: How to detect UTF-8 in plain C?](https://stackoverflow.com/a/22166804), in particular the part that suggests adding an identifier to the start of files. In your case, you would need to select something that normally would not begin your .sav files. If you're thinking ahead, you might select a byte sequence that says "This file has a header" followed by a byte or byte sequence that says "I am UTF-8" -- because future changes might warrant more identifiers. (Should that be a duplicate? Knowing that the alternative is Japanese might spark other answers.) – JaMiT Sep 28 '20 at 01:54
  • Thanks @JaMiT, I have updated the question. I have actually initially created a similar simple .exe file that checks for UTF-8 BOM, and returns simple true or false, but unfortunately the BOM causes display problems at the web application part. Upon I thought this article was the cause: https://www.w3.org/International/questions/qa-utf8-bom.en.html. I'll explore other solutions in the thread, this helps a lot. – Wolf Sep 28 '20 at 02:03
  • @Wolf Yes, you would have to also update the application code that reads the data to account for the header. – JaMiT Sep 28 '20 at 03:12
  • Some off-topic hints: `%~dp0` expands to a folder path ending always with a backslash. For that reason never concatenate `%~dp0` with an additional ``\`` with a file/folder name or wildcard pattern as this results in two backslashes in complete argument string which Windows has to correct later. Do not use `f` as loop variable although possible, especially on using also a modifier like `~f`. There are enough other characters which are not modifiers available for usage as loop variable. Delayed expansion is not necessary on using `if not errorlevel 1` instead of `if !ERRORLEVEL! equ 0`. – Mofi Sep 28 '20 at 05:28
  • So the command line `setlocal enabledelayedexpansion` can be replaced by `setlocal EnableExtensions DisableDelayedExpansion` with usage of `if not errorlevel 1` inside the `for` loop resulting in full qualified file names containing anywhere one or more `!` are processed also correct which is not the case with delayed expansion enabled just because of the `if` condition not using recommended syntax as described by help of command __IF__ output on running `if /?` in a command prompt window. BTW: `if not errorlevel 1` means IF ERRORLEVEL LESS THAN 1 which is in general IF ERRORLEVEL EQUAL 0. – Mofi Sep 28 '20 at 05:34
  • why does the BOM cause *display problems at the web application part*? Any web browser or web engine should handle the BOM without problem – phuclv Sep 28 '20 at 06:18
  • check this - https://github.com/npocmaka/batch.scripts/blob/master/fileUtils/encodingDetect.bat – npocmaka Sep 28 '20 at 06:25

1 Answers1

0

If you're upgrading then all the current files should be in Shift-JIS. Even if you have some situations that leave both Shift-JIS and UTF-8 files at the same time then there are only 2 types of encodings that you need to handle. Therefore you can work around this by checking if the file is not valid UTF-8 then it's Shift-JIS. Of course this will still subject to incorrect detection in some rare cases but otherwise it might be good for your use case

By default when reading text files a best-fit fallback or replacement fallback handler is used. We can change to an exception fallback so it'll throw an exception if a Shift-JIS file is opened as UTF-8

try {
    $t = [IO.File]::ReadAllText($f, [Text.Encoding]::GetEncoding(65001, `
         (New-Object Text.EncoderExceptionFallback), `
         (New-Object Text.DecoderExceptionFallback)))
} catch {
    # File is not UTF-8, reopen as Shift-JIS
    $t = [IO.File]::ReadAllText($f, [Text.Encoding]::GetEncoding(932))
}

# Write the file as UTF-8
[IO.File]::WriteAllText($f, $t)

It's better to loop through the files and convert in PowerShell. If you really need to use a batch file then wrap everything in a *.ps1 file and call it from batch

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • 1
    The distinction between ASCII and UTF-8 is artificial. ASCII **is** valid UTF-8. – IInspectable Sep 28 '20 at 05:35
  • @IInspectable I know. What I mean is that there may exist some sequence in Shift-JIS that's also valid UTF-8 but I can't confirm since I don't know how Shift-JIS encodes the values – phuclv Sep 28 '20 at 06:17