1

I'd really appreciate some help with this. I've wasted days on this problem and none of the suggestions I have found online seem to give me a fix.

I have a CSV file from a supplier. It appears to have been exported from an Microsoft system.

I'm using PHP to import the data into MySQL (both latest versions).

I have one particular record which contains a strange character that I can't get rid of. Manual editing to remove the character is possible, but I would prefer an automated solution as this will happen multiple times a day.

The character appears to be an interpretation of a “smart quote”. A hex editor tells me that the character codes are C2 and 92. In the hex editor it looks like a weird A followed by a smart quote. In other editors and Calc, Writer etc it just appears as a box. メ

I'm using mb_detect_encoding to determine the encoding. All records in the CSV file are returned as ASCII, except the one with the strange character, which is returned as UTF-8. I can insert the offending record into MySQL and it just appears in Workbench as a square.

MySQL tables are configured to utf-8 – utf8_unicode_ci and other unusual UTF characters (eg fractions) are ok.

I've tried lots of solutions to this...

etc etc but none of them have worked for me.

All I really want to do is remove or replace the offending character, ideally with a search and replace for the hex values but none of the examples I have tried have worked.

Can anyone help me move forward with this one please?

EDIT:
Can't post answer as not enough reputation:
Thanks for your input. Much appreciated.
I'm just going to go with the hex search and replace:

$DodgyText = preg_replace("/\xEF\xBE\x92/", "" ,$DodgyText);

I know it's not the elegant solution, but I need a quick fix and this works for me.

Community
  • 1
  • 1
Simon Roberts
  • 617
  • 1
  • 7
  • 14

3 Answers3

3

Another solution is:

$contents = iconv('UTF-8', 'Windows-1251//IGNORE',$contents);
$contents = iconv('Windows-1251', 'UTF-8//IGNORE',$contents);

Where you can replace Windows-1251 to your local encoding.

Ar2r
  • 108
  • 7
1

At a quick glance, this looks like a UTF-8 file. (UTF-8 is identical with the first 128 characters in the ASCII table, hence everything is detected as ASCII except for the special character.)

It should work if your database connection is also UTF-8 encoded (which it may not be by default).

How to do that depends on your database library, let us know which one you're using if you need help setting the connection encoding.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
0

updated code based on established findings

You can do search & replace on strings using hexadecimal notation:

str_replace("\xEF\xBE\x92", '', $value);

This would return the value with the special code removed

That said, if your database table is UTF-8, you shouldn't need that conversion; instead you could look at the connection (or session) character set (i.e. SET NAMES utf8;). Configuring this depends on what library you use to connect to your database.

To debug the value you could use bin2hex(); this usually helps in doing searches online.

Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
  • I tried that and MySQL didn't like it (Query failed: Incorrect string value: '\xBCqqq (...' for column). That 'qqq' is mine. – Simon Roberts Jul 02 '12 at 14:04
  • @SimonRoberts I've added one thing that you could improve your question with; the exact string contents in hexadecimal notation. – Ja͢ck Jul 02 '12 at 14:10
  • I think I may have got those Hex values wrong. I just looked at it again and the text string is (substituting dots for weird characters) is “that...s exactly”. The corresponding hex string is”74 68 61 74 EF BE 92 73 20 65 78 61 63 74 6C 79”. – Simon Roberts Jul 02 '12 at 14:29
  • @simonroberts well, you could do str_replace() on that or use utf8_encode() perhaps? – Ja͢ck Jul 02 '12 at 14:32