3

I am experiencing a very strange kind of encoding problem i don't really understand and never had before. I am using PHP 5.5 on an Ubuntu machine just for the info.

To the Problem

I have a simple file index.php where i want to print this simple string

<?php echo "übermotivierter";  ?>

When viewing this in the Browser i would expect following ouput

�bermotivierter

This works like expected!

To display this in the correct way i have done following steps

  1. Changed the encoding of my IDE ( Zend Studio ) to UTF-8 and saved the file again
  2. Set the appropriate html meta tag

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
  3. Set the appropriate php header

    <?php header("Content-Type: text/html; charset=utf-8"); ?>
    

After doing this i would have expected this to display in a correct way but i am still getting this weired � in the output!

The workaround

To make this thing displaying correctly i had to do that

<?php echo utf8_encode("übermotivierter");  ?>

Now it displays in a correct way.

My Question

I really dont understand why i have to use utf8_encode when my document already is encoded and saved in utf-8. That doesn't make any sense to me. Any explanation to this?

Jay
  • 1,035
  • 2
  • 11
  • 22

2 Answers2

3

With Git 2.18+ (Q2 2018), you might not need any third-party trick to convert a repo content into UTF-8, since the new "checkout-encoding" attribute can ask Git to convert the contents to the specified encoding when checking out to the working tree (and the other way around when checking in).

See commit e92d622, commit 541d059, commit 7a17918, commit 107642f, commit c6e4865, commit 10ecb82, commit 2f0c4a3 (15 Apr 2018), commit 66b8af3 (09 Mar 2018), and commit 13ecb46, commit a8270b0 (15 Feb 2018) by Lars Schneider (larsxschneider).
(Merged by Junio C Hamano -- gitster -- in commit 1ac0ce4, 08 May 2018)

convert: add 'working-tree-encoding' attribute

Git recognizes files encoded with ASCII or one of its supersets (e.g. UTF-8 or ISO-8859-1) as text files.
All other encodings are usually interpreted as binary and consequently built-in Git text processing tools (e.g. 'git diff') as well as most Git web front ends do not visualize the content.

Add an attribute to tell Git what encoding the user has defined for a given file. If the content is added to the index, then Git reencodes the content to a canonical UTF-8 representation. On checkout Git will reverse this operation.

If there is any issue, you now have the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable tracing for content that is reencoded with the 'working-tree-encoding' attribute.
This is useful to debug encoding issues.

The documentation now mentions:

Please note that using the working-tree-encoding attribute may have a number of pitfalls:

  • Alternative Git implementations (e.g. JGit or libgit2) and older Git versions (as of March 2018) do not support the working-tree-encoding attribute.
    If you decide to use the working-tree-encoding attribute in your repository, then it is strongly recommended to ensure that all clients working with the repository support it.

    For example, Microsoft Visual Studio resources files (*.rc) or PowerShell script files (*.ps1) are sometimes encoded in UTF-16.
    If you declare *.ps1 as files as UTF-16 and you add foo.ps1 with a working-tree-encoding enabled Git client, then foo.ps1 will be stored as UTF-8 internally.
    A client without working-tree-encoding support will checkout foo.ps1 as UTF-8 encoded file. This will typically cause trouble for the users of this file.

    If a Git client, that does not support the working-tree-encoding attribute, adds a new file bar.ps1, then bar.ps1 will be stored "as-is" internally (in this example probably as UTF-16).
    A client with working-tree-encoding support will interpret the internal contents as UTF-8 and try to convert it to UTF-16 on checkout. That operation will fail and cause an error.

  • Reencoding content requires resources that might slow down certain Git operations (e.g 'git checkout' or 'git add').

Use the working-tree-encoding attribute only if you cannot store a file in UTF-8 encoding and if you want Git to be able to process the content as text.


As an example, use the following attributes if your '*.ps1' files are UTF-16 encoded with byte order mark (BOM) and you want Git to perform automatic line ending conversion based on your platform.

*.ps1     text working-tree-encoding=UTF-16

Use the following attributes if your '*.ps1' files are UTF-16 little endian encoded without BOM and you want Git to use Windows line endings in the working directory.
Please note, it is highly recommended to explicitly define the line endings with eol if the working-tree-encoding attribute is used to avoid ambiguity.

*.ps1 text working-tree-encoding=UTF-16LE eol=CRLF
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
1

Not an answer but too long for a comment:

Could you please try

<?php
$s = "übermotivierter";
echo '<p>', $s, '</p><p>';
for($i=0; $i<strlen($s); $i++) {
    printf('%02x ', ord($s[$i]));
}
echo '</p>';

in the place where you had <?php echo "übermotivierter"; ?>?
What's the output of that?

VolkerK
  • 95,432
  • 20
  • 163
  • 226