0

My question is different from this one. I am trying to fix a broken encoding but I don't know how to proceed.

In my database I have this name:

mysql> select filename from file WHERE filename LIKE 'MAC%';
+-------------------------------------------------+
| filename                                        |
+-------------------------------------------------+
| MAC-1600PVå–æ‰±èª¬æ˜Žæ›¸.pdf                   |
+-------------------------------------------------+
1 row in set (0.00 sec)

But on my filesystem the file is named:

$ ls files/*MAC*
files/MAC-1600PV取扱説明書.pdf

I have tried to unpack both strings from PHP and the content differ:

The utf-8 sequence read from the filesystem:

=> "MAC-1600PV取扱説明書"
>>> unpack('C*', $u)
...
7 => 48,
8 => 48,
9 => 80,
10 => 86,
11 => 195,
12 => 165,
13 => 226,
14 => 128,
15 => 147,
16 => 195,
17 => 166,
18 => 226,

And for the one read from the database:

...
7 => 48,
8 => 48,
9 => 80,
10 => 86,
11 => 229,
12 => 143,
13 => 150,
14 => 230,
15 => 137,
16 => 177,

So at some-point I lost the original encoding and I have no clue of how to fix my database which is in utf8mb4.

Any advice?

nowox
  • 25,978
  • 39
  • 143
  • 293
  • Possible duplicate of [UTF-8 all the way through](https://stackoverflow.com/questions/279170/utf-8-all-the-way-through) – Dharman Jan 16 '19 at 20:32
  • 1
    How was the file name written? How is your DB connection set up? Maybe you can scan your directory and rewrite all the DB references? – user3783243 Jan 16 '19 at 20:32
  • Are your columns set to `utf8mb4`? Is your DB connection set to `utf8mb4`? – Dharman Jan 16 '19 at 20:34
  • I am suspecting the initial "insert" that was wrong I don't think my database is faulty – nowox Jan 16 '19 at 20:34
  • @Dharman, yes the collation for both the table and the columns are `utf8mb4`, but I think the bug was made before inserting to the database, perhaps 10 years ago. I have few files like this and I don't know how to fix the broken encoding. – nowox Jan 16 '19 at 20:37
  • `utf8mb4` wasn't around in 2009 so likely it was another charset then. – user3783243 Jan 16 '19 at 20:40
  • You may look this https://stackoverflow.com/questions/436220/determine-the-encoding-of-text-in-python (python) but PHP has libmagic, and probably someone ported chardet in PHP, or you can look the code and inspire from it for your detection). – Giacomo Catenazzi Jan 21 '19 at 11:14

0 Answers0