2

I try to read non-UTF8 encoded file and print out the content. Like:

content, _ := os.ReadFile("example.csv")
fmt.Println(string(content))

Output:

����ҭ��dzԪ�� �Ӻ��Ҵ�˭�

Then, I tried to convert the content of the rune and decode it to utf8 like:

br := make([]rune, 0)
for len(content) > 0 {
    r, size := utf8.DecodeRune(content)
    br = append(br, r)
    content = content[size:]
}
fmt.Println(string(br))

But the result was the same. How can I get right content? PS: I can not know the file encoding type, they can be several type like raditionalchinese.Big5 or japanese.ShiftJIS and content must not be file. It can be a string.

Ken White
  • 123,280
  • 14
  • 225
  • 444
sinkyminky
  • 73
  • 5
  • Please [edit] your question to improve your [mcve]. In particular, share a hexadecimal dump of the file (2-3 lines should suffice). – JosefZ Dec 24 '22 at 19:40
  • Hi @JosefZ thank you! Content has Thai words like "ตำบลในเมือง" and also latin letters. In latin letters there is no any problem. But in Thai letters are printed out like i mentioned in question. – sinkyminky Dec 24 '22 at 20:09
  • If you don't know the encoding, it's basically impossible. There are algorithms out there that will guess, but they guess wrong sometimes. – hobbs Dec 24 '22 at 20:54
  • Here’s a worked example, https://stackoverflow.com/a/73573464/246801. In your case you’ll want to swap out the writers for readers. I was inspired by this answer which is already reading, https://stackoverflow.com/a/55632545/246801. – Zach Young Dec 25 '22 at 03:46
  • 1
    @ZachYoung your solution need to know to encoding type of the text but it can be various types in my case and i assume like hobbs mentioned there is not way to decode the text without know to encoded type of text – sinkyminky Dec 25 '22 at 08:29
  • Did you try to decode using "unicode/utf16"? See the docs [here](https://pkg.go.dev/unicode/utf16). – dev.bmax Dec 25 '22 at 14:05
  • Ah, I didn’t read that far, sorry. You want a character detector. There’s the [uchardet project](https://github.com/freedesktop/uchardet), but I don’t know of anything in Go. Python has a simpler character detector if you wanted inspiration for porting to Go, https://pypi.org/project/chardet. – Zach Young Dec 25 '22 at 19:40
  • Oh, someone already has a Go chardet, https://pkg.go.dev/github.com/gogs/chardet#section-readme. – Zach Young Dec 25 '22 at 21:13
  • Here’s a good thread discussing the ins and outs of heuristic detectors: https://groups.google.com/g/mozilla.dev.platform/c/TCiODi3Fea4 – Zach Young Dec 25 '22 at 21:50

1 Answers1

1

Most probably you need packages from the golang.org/x/text/encoding hierarchy.

In particular, the golang.org/x/text/encoding/charmap allows creating a encoding.Decoder able to translate stream of bytes in a legacy non-UTF-8 encoding to a UTF-8-encoded stream of data—native to Go.

kostix
  • 51,517
  • 14
  • 93
  • 176
  • Hi @kostix, thank you! Mentioned libs are useful if you know the encoding type of text. In my case it can be various types and I do not know when decoding text what is the encoding type. Do you have any suggestion to encoding type agnostic transform to utf-8 ? – sinkyminky Dec 25 '22 at 08:49
  • 1
    @sinkyminky, there's no such thing as «encoding type agnostic transform to utf-8» (or any other encoding): the encoding in which a piece of data is encoded is ether communicated along with that data (for instance, that's what `Content-Type` in HTTP and e-mail (MIME 1.0) is for or, barring that, can be _tried to be guessed._ This last one can be implemented but is inherently fragile. – kostix Dec 25 '22 at 10:50
  • Thanks for explanation! I am trying to understand how i.e. iOS Numbers application can open the file with right encoding format. In some ways Numbers get the encoding type of file and open in right format because content is shown right if I open the file with Numbers. I just wonder to is there anyway to the detect file encoding type in golang like Numbers do. – sinkyminky Dec 25 '22 at 11:54
  • 1
    @sinkyminky, I may guess that this app either guesses the encoding using some heuristics or just _assumes_ some non-Unicode encoding based on the device's [locale](https://en.wikipedia.org/wiki/Locale_(computer_software)). This is actually nothing new: say, MS-DOS and Windows used to have non-Unicode 8-bit encodings, and the current encoding in use was based on the "system's language". Windows even had two such "code pages" active at any time: one for native programs, and one for legacy MS-DOS programs. – kostix Dec 25 '22 at 13:58
  • 1
    @sinkyminky, also see [this](https://stackoverflow.com/a/17988893/720999) and the link to the Apple docs it contains. – kostix Dec 25 '22 at 14:01
  • I am appreciated for this informative explanations. @kostix – sinkyminky Dec 25 '22 at 18:24