Convert all types of smart quotes with PHP

Question

I am trying to convert all types of smart quotes to regular quotes when working with text. However, the following function I've compiled still seems to be lacking support and proper design.

How can I properly get all quote characters converted?

function convert_smart_quotes($string)
{
    $quotes = array(
        "\xC2\xAB"   => '"', // « (U+00AB) in UTF-8
        "\xC2\xBB"   => '"', // » (U+00BB) in UTF-8
        "\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8
        "\xE2\x80\x99" => "'", // ’ (U+2019) in UTF-8
        "\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8
        "\xE2\x80\x9B" => "'", // ‛ (U+201B) in UTF-8
        "\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8
        "\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8
        "\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8
        "\xE2\x80\x9F" => '"', // ‟ (U+201F) in UTF-8
        "\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8
        "\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8
    );
    $string = strtr($string, $quotes);

    // Version 2
    $search = array(
        chr(145),
        chr(146),
        chr(147),
        chr(148),
        chr(151)
    );
    $replace = array("'","'",'"','"',' - ');
    $string = str_replace($search, $replace, $string);

    // Version 3
    $string = str_replace(
        array('&#8216;','&#8217;','&#8220;','&#8221;'),
        array("'", "'", '"', '"'),
        $string
    );

    // Version 4
    $search = array(
        '&lsquo;', 
        '&rsquo;', 
        '&ldquo;', 
        '&rdquo;', 
        '&mdash;',
        '&ndash;',
    );
    $replace = array("'","'",'"','"',' - ', '-');
    $string = str_replace($search, $replace, $string);

    return $string;
}

Note: This question is a complete query about the full of gamut of quotes, including the "Microsoft" quotes asked here This is a "duplicate" in the same way that asking about all tire sizes is a "duplicate" of asking for a car tire size.

What is your purpose in replacing smart quotes? It would normally be best to preserve them; if you have problems with handling the characters then it's likely you have problems with all other non-ASCII characters too, which aren't going to go away by hiding the smart quotes. This code, with its attempt to handle text as both UTF-8 and ISO-8859-1, and both raw text and HTML at the same time, is a messy business that will typically badly mangle many other Unicode characters than just the quotes. — bobince, Nov 17 '13 at 14:45
@bobince, I'm doing string parsing and the quote characters are important to me. I do handle the rest of the unicode glyphs as-is. — Xeoncross, Nov 18 '13 at 16:53
@bobince I would be happy to award an answer that handles other characters as well - but my concern is identifying all quote-glyphs so I can parse strings without having dozens of other forms to worry about. — Xeoncross, Jan 30 '14 at 16:16
What kind of parsing are you trying to do, that requires different types of quote to be converted to one? Converting eg `‘don't’` to use all apostrophes would seem to make it harder to parse if anything. — bobince, Jan 30 '14 at 17:13
In terms of ‘forms’, you simply cannot replace all possible encoded versions of a character in one function without irretrievably mangling other characters. I would suggest getting all your strings in UTF-8 encoding internally and then using only the ‘Version 1’ replacements above. If you need to handle text in HTML markup you should be HTML-decoding it to get plain text so you can then do the same replacement. It is no good trying to replace encoded HTML because there are potentially many forms and encodings. — bobince, Jan 30 '14 at 17:14
(For example versions 3 and 4 are missing things like `‘`, `‚`, `‚` and so on which are valid HTML.) — bobince, Jan 30 '14 at 17:16
@bobince I'm fine with `$string = html_entity_decode(iconv('utf-8', 'utf-8', $string));` before the quote parsing if that is needed. — Xeoncross, Jan 30 '14 at 17:31
Yes that would work fine if your input was definitely HTML-format text content. There is one niggling difference: in non-XML-based HTML, character references in the range `` to `` (`` to `ÿ`) get decoded by web browsers to the characters with the same-numbered Windows code page 1252 code unit, instead of the characters U+0080 to U+00FF as you would expect. PHP doesn't reproduce this historical quirk and will leave ampersand sequences in the string for these malformed references. — bobince, Jan 30 '14 at 19:10
I didn't know that, thanks for sharing. I still would like to properly decode escaped characters to straight-up unicode points though. I just want some peace & qu[oi]t͏e҉s — Xeoncross, Jan 30 '14 at 22:51
@bobince, what you say is not true for my PHP 5.3.10, nor do I see any reason for not decoding numeric HTML entities, when the target encoding has the corresponding characters. What is true, though, is that the `"UTF-8"` pararameter to `html_entity_decode()` is needed for PHP < 5.4.0, since the default changed from `"ISO-8859-1"` to `"UTF-8"` in 5.4.0. — Walter Tross, Feb 01 '14 at 17:44

Walter Tross · Accepted Answer · 2014-02-04T08:41:53.717

You need something like this (assuming UTF-8 input, and ignoring CJK (Chinese, Japanese, Korean)):

$chr_map = array(
   // Windows codepage 1252
   "\xC2\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark
   "\xC2\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark
   "\xC2\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark
   "\xC2\x91" => "'", // U+0091⇒U+2018 left single quotation mark
   "\xC2\x92" => "'", // U+0092⇒U+2019 right single quotation mark
   "\xC2\x93" => '"', // U+0093⇒U+201C left double quotation mark
   "\xC2\x94" => '"', // U+0094⇒U+201D right double quotation mark
   "\xC2\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark

   // Regular Unicode     // U+0022 quotation mark (")
                          // U+0027 apostrophe     (')
   "\xC2\xAB"     => '"', // U+00AB left-pointing double angle quotation mark
   "\xC2\xBB"     => '"', // U+00BB right-pointing double angle quotation mark
   "\xE2\x80\x98" => "'", // U+2018 left single quotation mark
   "\xE2\x80\x99" => "'", // U+2019 right single quotation mark
   "\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark
   "\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark
   "\xE2\x80\x9C" => '"', // U+201C left double quotation mark
   "\xE2\x80\x9D" => '"', // U+201D right double quotation mark
   "\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark
   "\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark
   "\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark
   "\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark
);
$chr = array_keys  ($chr_map); // but: for efficiency you should
$rpl = array_values($chr_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, html_entity_decode($str, ENT_QUOTES, "UTF-8"));

Here comes the background:

Every Unicode character belongs to exactly one "General Category", of which the ones that can contain quote characters are the following:

(these pages are handy for checking that you didn't miss anything - there is also an index of categories)

It is sometimes useful to match these categories in a Unicode-enabled regex.

Furthermore, Unicode characters have "properties", of which the one you are interested in is Quotation_Mark. Unfortunately, these are not accessible in a regex.

In Wikipedia you can find the group of characters with the Quotation_Mark property. The final reference is PropList.txt on unicode.org, but this is an ASCII textfile.

In case you need to translate CJK characters too, you only have to get their code points, decide their translation, and find their UTF-8 encoding, e.g., by looking it up in fileformat.info (e.g., for U+301E: http://www.fileformat.info/info/unicode/char/301e/index.htm).

Regarding Windows codepage 1252: Unicode defines the first 256 code points to represent exactly the same characters as ISO-8859-1, but ISO-8859-1 is often confused with Windows codepage 1252, so that all browsers render the range 0x80-0x9F, which is "empty" in ISO-8859-1 (more exactly: it contains control characters), as if it were Windows codepage 1252. The table in the Wikipedia page lists the Unicode equivalents.

Note: strtr() is often slower than str_replace(). Time it with your input and your PHP version. If it's fast enough, you can directly use a map like my $chr_map.

If you are not sure that your input is UTF-8 encoded, AND are willing to assume that if it's not, then it's ISO-8859-1 or Windows codepage 1252, then you can do this before anything else:

if ( !preg_match('/^\\X*$/u', $str)) {
   $str = utf8_encode($str);
}

Warning: this regex can in very rare cases fail to detect a non-UTF-8 encoding, though. E.g.: "Gruß…"/*CP-1252*/=="Gru\xDF\x85" looks like UTF-8 to this regex (U+07C5 is the N'ko digit 5). This regex can be slightly enhanced, but unfortunately it can be shown that there exists NO completely foolproof solution to the problem of encoding detection.

If you want to normalize the range 0x80-0x9F that stems from Windows codepage 1252 to regular Unicode codepoints, you can do this (and remove the first part of the $chr_map above):

$normalization_map = array(
   "\xC2\x80" => "\xE2\x82\xAC", // U+20AC Euro sign
   "\xC2\x82" => "\xE2\x80\x9A", // U+201A single low-9 quotation mark
   "\xC2\x83" => "\xC6\x92",     // U+0192 latin small letter f with hook
   "\xC2\x84" => "\xE2\x80\x9E", // U+201E double low-9 quotation mark
   "\xC2\x85" => "\xE2\x80\xA6", // U+2026 horizontal ellipsis
   "\xC2\x86" => "\xE2\x80\xA0", // U+2020 dagger
   "\xC2\x87" => "\xE2\x80\xA1", // U+2021 double dagger
   "\xC2\x88" => "\xCB\x86",     // U+02C6 modifier letter circumflex accent
   "\xC2\x89" => "\xE2\x80\xB0", // U+2030 per mille sign
   "\xC2\x8A" => "\xC5\xA0",     // U+0160 latin capital letter s with caron
   "\xC2\x8B" => "\xE2\x80\xB9", // U+2039 single left-pointing angle quotation mark
   "\xC2\x8C" => "\xC5\x92",     // U+0152 latin capital ligature oe
   "\xC2\x8E" => "\xC5\xBD",     // U+017D latin capital letter z with caron
   "\xC2\x91" => "\xE2\x80\x98", // U+2018 left single quotation mark
   "\xC2\x92" => "\xE2\x80\x99", // U+2019 right single quotation mark
   "\xC2\x93" => "\xE2\x80\x9C", // U+201C left double quotation mark
   "\xC2\x94" => "\xE2\x80\x9D", // U+201D right double quotation mark
   "\xC2\x95" => "\xE2\x80\xA2", // U+2022 bullet
   "\xC2\x96" => "\xE2\x80\x93", // U+2013 en dash
   "\xC2\x97" => "\xE2\x80\x94", // U+2014 em dash
   "\xC2\x98" => "\xCB\x9C",     // U+02DC small tilde
   "\xC2\x99" => "\xE2\x84\xA2", // U+2122 trade mark sign
   "\xC2\x9A" => "\xC5\xA1",     // U+0161 latin small letter s with caron
   "\xC2\x9B" => "\xE2\x80\xBA", // U+203A single right-pointing angle quotation mark
   "\xC2\x9C" => "\xC5\x93",     // U+0153 latin small ligature oe
   "\xC2\x9E" => "\xC5\xBE",     // U+017E latin small letter z with caron
   "\xC2\x9F" => "\xC5\xB8",     // U+0178 latin capital letter y with diaeresis
);
$chr = array_keys  ($normalization_map); // but: for efficiency you should
$rpl = array_values($normalization_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, $str);

@SebastiánGrignoli, you can read it here: http://www.regular-expressions.info/unicode.html#grapheme As it says there: "You can consider `\X` the Unicode version of the dot". More exactly, it matches UTF-8 non-modifier characters optionally followed by modifier characters, from start (`^`) to end (`$`). I don't know if it also checks the validity of the modifiers for the characters they modify, but for sure it checks that the whole string consists of valid UTF-8 byte sequences (that encode valid Unicode codepoints), and that it does not start with a modifier. — Walter Tross, Feb 06 '14 at 07:17
@SebastiánGrignoli, sorry, I should have said "combining mark" (`\p{M}`) instead of "modifier" — Walter Tross, Feb 06 '14 at 10:34
@WalterTross - thanks very much - I was looking for some out of the box solution, but could not find one. Insteand I created a package for this purpose - using part of the above - hope you don't mind. https://github.com/sebastiansulinski/smart-quotes — Sebastian Sulinski, Feb 11 '15 at 11:00
@WalterTross - This really saved me. Thanks for this!! Worked like a charm! — cbloss793, Dec 09 '15 at 17:13
The one and only complete and correct answer to this question on the Web (probably not really but you know what I mean). Too bad it's not ranked higher in relevant searches. — John, Feb 01 '17 at 08:47
Better to use `$str = mb_convert_encoding($str, 'UTF-8', 'Windows-1252');` since the `utf8_encode()` will destroy smart quotes. — Frank Forte, Aug 02 '18 at 17:49
@FrankForte true in general, but if you read carefully, I have written “**before anything else**” — Walter Tross, Aug 02 '18 at 18:26

score 14 · Answer 2 · answered May 25 '18 at 15:19

14

You can use this one function to convert all characters:

$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);

Be sure and change your types to what you need.

(note: this is from another similar question found here).

answered May 25 '18 at 15:19

Lokiare

1,238
1
15
23

2

To be clear, this converts more than just smart quotes, so may have unintended consequences. – John Rix Sep 18 '19 at 16:07

Convert all types of smart quotes with PHP

2 Answers2

Linked

Related