5

"We’ve detected the file encoding as ISO-8859-1. When you commit changes we will transcode it to UTF-8"

This is what GitHub displays when I try to upload Windows 1252 .txt files.

Result: All characters unkonwn by UTF-8 are displayed as �

My ATOM editor is using Windows 1252 as default successfully but the staged changes window is showing the �'s.

How can I stop GitHub from doing this?

T. Tom
  • 69
  • 1
  • 6
  • Is there a specific reason for the files to be win-1252? Because if not, you should probably use the much more universal utf-8 encoding on your text files. The problem with Windows 8-bit encodings is that the special characters can look different when you open them on a computer set to a different language, and there's _no way_ to detect what the file's original intended encoding was. – Nyerguds Oct 07 '18 at 15:51
  • yes because my files have many characters like Š/š and Ž/ž which are not supported by UTF-8 – T. Tom Oct 07 '18 at 15:53
  • 1
    Actually, utf-8 supports _all_ unicode characters. You just need to convert it correctly. – Nyerguds Oct 07 '18 at 15:53
  • no it does not and thats why im getting �'s – T. Tom Oct 07 '18 at 15:55
  • also it really has to be win 1252 because the application im commiting to is using this charset – T. Tom Oct 07 '18 at 15:55
  • That is because the files are _interpreted as_ utf-8. You need to _actively convert them_ instead. Open a file in Notepad, select Save As, and specifically select "utf-8" from the encoding menu. – Nyerguds Oct 07 '18 at 15:56
  • Are you sure your files are actually being modified though? If they show � in utf-8 that may mean that, while Git tries to _show_ them as utf-8, and that fails, the actual files are still just win-1252. I advise you to read up on encodings in general. Only UTF encodings actually save into the file that the file is saved with a specific encoding (using a 'byte order mark' at the start of the file). Windows encodings do no such thing, which is part of why they are so problematic to identify. – Nyerguds Oct 07 '18 at 15:57
  • jesus christ youre right i downloaded it again just to see it in my editor and it still is 1252. well ty lol – T. Tom Oct 07 '18 at 16:02
  • There seem to be defaults in the Git settings, mind you... I advise you to check out [this question](https://stackoverflow.com/questions/48907049/what-is-the-appropriate-character-encoding-for-a-git-repo). – Nyerguds Oct 07 '18 at 16:03
  • where do I find this setting? is this accessible through the browser? – T. Tom Oct 07 '18 at 16:08
  • I have no idea. Check the question I linked, and do some research on your own... – Nyerguds Oct 07 '18 at 16:08
  • 2
    The problem seems to be more one of the file being detected as ISO-8859-1 when converting to UTF-8; Windows-1252 is a superset of ISO-8859-1, and Š/š and Ž/ž are **not** part of ISO-8859-1, but are part of windows-1252. So the fact these are mapped to � is 'expected' if it assumes ISO-8859-1 when converting to UTF-8. – Mark Rotteveel Oct 07 '18 at 16:11
  • @Nyerguds: You know having a file starting with UTF-8 BOM, while indicative of having an UTF-8 BOM file, is not conclusive evidence? Nor is absence of UTF-8 BOM any more conclusive. – Deduplicator Oct 07 '18 at 16:37
  • 1
    @Deduplicator I'm aware. But utf-8 has strict bitwise rules it must follow to be valid, so it's still easy to detect. Meanwhile, there's no way to distinguish the >0x80 content of 8-bit extended ascii encodings except language-based heuristics. – Nyerguds Oct 07 '18 at 16:41
  • @Nyerguds "*Only UTF encodings actually save into the file that the file is saved with a specific encoding (using a 'byte order mark' at the start of the file)*" - true, though in the case of UTF-8, a BOM is generally *discouraged* by the Unicode standard, as 1) it is not backwards compatible with legacy apps that would otherwise be able to consume UTF-8 files when treating them as ASCII files (a key motivation for UTF-8), and 2) its presence can cause some ambiguities when converting data between UTFs as to whether U+FEFF was intended to be a BOM or not. – Remy Lebeau Oct 09 '18 at 17:17
  • @RemyLebeau I still prefer the BOM to be there even if the file is pure ASCII, for the simple reason that if it's present, the file is much more likely to be re-saved as utf8 when edits that _may add special characters_ happen to it later. Without the BOM, it's once again completely ambiguous and might be re-saved as win-1252. Especially for programmatically treated files, it's nice to know what encoding to expect on them. – Nyerguds Oct 10 '18 at 20:57
  • @Nyerguds oh, I'm not debating the usefulness of a BOM in UTF-8, I'm just pointing out that some software can't handle it correctly if it is present, that's all. – Remy Lebeau Oct 10 '18 at 21:12

1 Answers1

1

In addition of setting a .gitattributes encoding directive to utf-8, you can also convert your existing files to utf-8, as in here:

#!/bin/sh

find . -type f -print | while read f; do
        mv -i "$f" "$f.recode.$$"
        iconv -f iso-8859-1 -t utf-8 < "$f.recode.$$" > "$f"
        rm -f "$f.recode.$$"
done

You can tweak the script to limit that to only a subset of your files.
Only by pushing utf-8 files will you be sure to see the right characters in your GitHub repo repo page.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250