25

I have a form with a textarea. Users enter a block of text which is stored in a database.

Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,â€

What function should I call on the input string to convert smart quotes to regular quotes and emdashes to regular dashes?

I am working in PHP.

Update: Thanks for all of the great responses so far. The page on Joel's site about encodings is very informative: http://www.joelonsoftware.com/articles/Unicode.html

Some notes on my environment:

The MySQL database is using UTF-8 encoding. Likewise, the HTML pages that display the content are using UTF-8 (Update:) by explicitly setting the meta content-type.

On those pages the smart quotes and emdashes appear as a diamond with question mark.

Solution:

Thanks again for the responses. The solution was twofold:

  1. Make sure the database and HTML files were explicitly set to use UTF-8 encoding.
  2. Use htmlspecialchars() instead of htmlentities().
A J
  • 3,970
  • 14
  • 38
  • 53
GloryFish
  • 13,078
  • 16
  • 53
  • 43

13 Answers13

15

This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html

theraccoonbear
  • 4,283
  • 3
  • 33
  • 41
9

The mysql database is using UTF-8 encoding. Likewise, the html pages that display the content are using UTF-8.

The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type header of "text/html;charset=utf-8" or add <meta> tags to your HTMLs:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>

That way, the content type of the data submitted to PHP will also be the same.

I had a similar issue and adding the <meta> tag worked for me.

Ates Goral
  • 137,716
  • 26
  • 137
  • 190
4

It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.

Here is some info on migrating your database to another character encoding, at least for a MySQL database.

Kip
  • 107,154
  • 87
  • 232
  • 265
2

This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.

What we do is force the text through iconv

// Convert input data to UTF8, ignore any odd (MS Word..) chars
// that don't translate
$input = iconv("ISO-8859-1","UTF-8//IGNORE",$input);

The //IGNORE flag means that anything that can't be translated will be thrown away.

If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.

ConroyP
  • 40,958
  • 16
  • 80
  • 86
  • 1
    This seems like such a perfect "quick fix" but sadly it wound up making my test case significantly worse by adding *more* invalid characters. – niczak Feb 27 '09 at 21:08
  • 5
    Converting from Latin 1 to UTF-8 only makes sense if you *know* that the input character set is Latin 1. But if the input is already UTF-8, you will only garble it further by "translating" it from Latin 1 to UTF-8 a second time. – Mark E. Haase Feb 08 '11 at 21:17
1

We would often use standard string replace functions for that. Even though the nature of ASCII/Unicode in that context is pretty murky, it works. Just make sure your php file is saved in the right encoding format, etc.

mspmsp
  • 953
  • 6
  • 7
1

In my experience, it's easier to just accept the smart quotes and make sure you're using the same encoding everywhere. To start, add this to your form tag: accept-charset="utf-8"

Patrick McElhaney
  • 57,901
  • 40
  • 134
  • 167
1

You could try mb_ convert_encoding from ISO-8859-1 to UTF-8.

$str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');

This assumes you want UTF-8, and convert can find reasonable replacements... if not, mb_str_replace or preg_replace them yourself.

Greg
  • 316,276
  • 54
  • 369
  • 333
1

This may not be the best solution, but I'd try testing to find out what PHP sees. Let's say it sees "–" (there are a few other possibilities, like simple "“" or maybe "&#8220;"). Then do a str_replace to get rid of all of those and replace them with normal quotes, before stuffing the answer in a database.

The better solution would probably involve making the end-to-end data passing all UTF-8, as people are trying to help with in other answers.

Domenic
  • 110,262
  • 41
  • 219
  • 271
1

You have to be sure your database connection is configured to accept and provide UTF-8 from and to the client (otherwise it will convert to the "default", which is usually latin1).

In practice this means running a query SET NAMES 'utf8';

http://www.phpwact.org/php/i18n/utf-8/mysql

Also, smart quotes are part of the windows-1252 character set, not iso-8859-1 (latin-1). Not very relevant to your problem, but just FYI. The euro symbol is in there as well.

Joeri Sebrechts
  • 11,012
  • 3
  • 35
  • 50
1

If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...

You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():

$trans_tbl = false;

function htmlEncode($text) {

  global $trans_tbl;

  // create translation table once
  if(!$trans_tbl) {
    // start with the default set of conversions and add more.

    $trans_tbl = get_html_translation_table(HTML_ENTITIES); 

    $trans_tbl[chr(130)] = '&sbquo;';    // Single Low-9 Quotation Mark
    $trans_tbl[chr(131)] = '&fnof;';    // Latin Small Letter F With Hook
    $trans_tbl[chr(132)] = '&bdquo;';    // Double Low-9 Quotation Mark
    $trans_tbl[chr(133)] = '&hellip;';    // Horizontal Ellipsis
    $trans_tbl[chr(134)] = '&dagger;';    // Dagger
    $trans_tbl[chr(135)] = '&Dagger;';    // Double Dagger
    $trans_tbl[chr(136)] = '&circ;';    // Modifier Letter Circumflex Accent
    $trans_tbl[chr(137)] = '&permil;';    // Per Mille Sign
    $trans_tbl[chr(138)] = '&Scaron;';    // Latin Capital Letter S With Caron
    $trans_tbl[chr(139)] = '&lsaquo;';    // Single Left-Pointing Angle Quotation Mark
    $trans_tbl[chr(140)] = '&OElig;';    // Latin Capital Ligature OE

    // smart single/ double quotes (from MS)
    $trans_tbl[chr(145)] = '&lsquo;'; 
    $trans_tbl[chr(146)] = '&rsquo;'; 
    $trans_tbl[chr(147)] = '&ldquo;'; 
    $trans_tbl[chr(148)] = '&rdquo;'; 

    $trans_tbl[chr(149)] = '&bull;';    // Bullet
    $trans_tbl[chr(150)] = '&ndash;';    // En Dash
    $trans_tbl[chr(151)] = '&mdash;';    // Em Dash
    $trans_tbl[chr(152)] = '&tilde;';    // Small Tilde
    $trans_tbl[chr(153)] = '&trade;';    // Trade Mark Sign
    $trans_tbl[chr(154)] = '&scaron;';    // Latin Small Letter S With Caron
    $trans_tbl[chr(155)] = '&rsaquo;';    // Single Right-Pointing Angle Quotation Mark
    $trans_tbl[chr(156)] = '&oelig;';    // Latin Small Ligature OE
    $trans_tbl[chr(159)] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis

    ksort($trans_tbl);
  }

  // escape HTML      
  return strtr($text, $trans_tbl); 
}
Jonathan Lidbeck
  • 1,555
  • 1
  • 14
  • 15
1

Actually the problem is not happening in PHP but it is happening in JavaScript, it is due to copy/paste from Word, so you need to solve your problem in JavaScript before you pass your text to PHP, Please see this answer https://stackoverflow.com/a/6219023/1857295.

Community
  • 1
  • 1
Billel Hacaine
  • 157
  • 1
  • 11
  • please add the relevant part of the answer. – Robert Feb 11 '16 at 09:12
  • @Robert he said "I have a form with a textarea. Users enter a block of text which is stored in a database.", so I believe that that's mean that he uses JavaScript to pass data from front side(i.e. browser) to server side (i.e. PHP). He said as well "paste text from Word", "What function should I call on the input string" which means before the data enters to MySQL, therefore using that solution will avoid him having those strange characters in the database in the first place. – Billel Hacaine Feb 11 '16 at 10:04
1

the problem is on the mysql charset, I fixed my issues with this line of code.

mysql_set_charset('utf8',$link); 
hawshy
  • 183
  • 1
  • 9
  • This worked for me as well, added directly above the query that runs the `INSERT`/`UPDATE`. Everything else was set to UTF8 properly, the table charset, the column collations, the HTML output page. Glad this finally did the trick! – purefusion Feb 21 '14 at 21:41
1

You have to manually change the collation of individual columns to UTF8; changing the database overall won't alter these.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Dazbert
  • 361
  • 2
  • 3