0

I want to compare two string but one is ASCII encoded and the other is UTF-8 encoded.

I try to convert the ASCII string to UTF-8 with utf8_encode, or iconv or mb_convert_encoding but it does not work. It is still in ASCII format.

What is strange it is that the ASCII string is defined in my php code file (not a string coming from a file or database). When there is an accent in the string, then it is utf-8 encoded

Here is an example of my php file

$return = array(
  array('name' => 'Date'), // => ASCII
  array('name' => 'État') // => utf-8
)

And here is my string comparison

// $line is coming from a file with 'fopen'. It is equal to 'Date'
$stringUtf8 = iconv(mb_detect_encoding($line), 'UTF-8//IGNORE', $line); // => UTF-8
$foundHeader = "Date"; // => ASCII

if($foundHeader == $stringUtf8){
  echo "foo";
}

It never echoes "foo" :(

And here is why I dump

dd([
  $foundHeader, 
  strlen($foundHeader),
  mb_detect_encoding($foundHeader),
  $stringUtf8,
  strlen($stringUtf8),
  mb_detect_encoding($stringUtf8),
]);

& the result

array:6 [
  0 => "Date"
  1 => 4
  2 => "ASCII"
  3 => "Date"
  4 => 7
  5 => "UTF-8"
]

And here it is how I get filecontent

$handle = fopen($fichier->getPathname(), 'r');
while ($line = fgets($handle)){
 ...// do stuffs
}

Nico
  • 103
  • 1
  • 11
  • 3
    ASCII is a subset of UTF-8. All ASCII characters have the same encoding in UTF-8, so there's nothing to convert. – Barmar Mar 25 '22 at 17:15
  • What's the actual problem you're having? You show an array definition, but not any comparison code. – Barmar Mar 25 '22 at 17:20
  • `var_dump(iconv(mb_detect_encoding($line), 'UTF-8//IGNORE', $line), mb_detect_encoding($line), $line)` gives what? Is `État` somewhere? – user3783243 Mar 25 '22 at 18:09
  • it gives `Date` as UTF-8 encoded. No `État` is not used in the comparison. It was just to explain that when I define a string with accent, it is by default encoded as utf-8 but when I define a string without accent (as `Date`), it is encoded as "ASCII" – Nico Mar 25 '22 at 18:19
  • https://www.php.net/manual/en/function.str-split.php then https://www.php.net/manual/en/function.ord.php on each char gives what values? – user3783243 Mar 25 '22 at 18:26
  • What does `echo strlen($stringUtf8)` show? – Barmar Mar 25 '22 at 18:35
  • 1
    There is literally no difference between the string `Date` whether it is encoded as ASCII or UTF8. As Barmar said, UTF8 is a superset of ASCII. Anything _beyond_ the basic English alphabet, basic punctuation, and numbers _cannot be ASCII_ and is going to be encoded as _something else_ which is usually a superset of ASCII and _might_ be UTF8. You need to describe the problem that you're _actually having_, rather than this strange way you're trying to solve it, as that will likely be easier for us to solve. – Sammitch Mar 25 '22 at 18:36
  • @Sammitch I updated my question with the dump of the variables – Nico Mar 25 '22 at 21:46
  • Add in a bin2hex of the strings, and I'll tell you which non-printing characters you have in there. – Sammitch Mar 25 '22 at 21:57
  • I got `bin2hex($foundHeader) = 44617465` and `bin2hex($stringUtf8) = efbbbf44617465` – Nico Mar 25 '22 at 22:26
  • @Nico Your $stringUtf8 has the BOM in it. https://stackoverflow.com/questions/32184933/remove-bom-%C3%AF-from-imported-csv-file You should use hex decoding tools in the future to debug (e.g. `44617465` is the same in both and maps to `Date` so just needed to look up what `efbbbf` is). – user3783243 Mar 28 '22 at 01:57
  • Thanks @user3783243. Now I got `bin2hex($foundHeader) = 4e6f6d206465206c276f7074696f6e2031` and `bin2hex($stringUtf8) = 4e6f6d206465206c276f7074696f6ec2a031` as I am looping through line when reading the file. Is there an issue with how I read the file? – Nico Mar 28 '22 at 10:01
  • @Nico Can you add how you are going over the file? – user3783243 Mar 28 '22 at 12:43
  • @user3783243 I added it in the post at the end – Nico Mar 29 '22 at 21:36
  • I found it, it is c2a0 for non-breaking space instead of 20 for space – Nico Mar 29 '22 at 21:46

0 Answers0