6

I'd like to pad each multibyte character with spaces on either side. I can strip them out just fine, but I'd like to leave them in and just pad them.

For example: to .

Using underscores to represent spaces: to ______

Ryan
  • 14,682
  • 32
  • 106
  • 179
  • *I can strip them out just fine* - please share how. – Wiktor Stribiżew Jan 10 '17 at 21:52
  • Sure Wiktor. Here's how I can strip out each multibyte character: `preg_replace( "/[^a-z0-9 ]/i", "", $string );`. But I don't know how to pad them. – Ryan Jan 10 '17 at 21:54
  • If you can find the character can't you just surround it with a span and add padding/margin? – chris85 Jan 10 '17 at 21:55
  • 1
    No, [that does not work](http://ideone.com/tRq85w). – Wiktor Stribiżew Jan 10 '17 at 21:56
  • Correct. You asked me how I can strip out multibyte characters. The comment was an answer to that. I still don't know how to pad spaces around each multibyte character in a string. – Ryan Jan 10 '17 at 21:57
  • Instagram published a regex for all emojis 2 years ago. May help solve your problem: https://engineering.instagram.com/emojineering-part-ii-implementing-hashtag-emoji-7b653b221c82#.gcqbj7109 – Ned Rockson Jan 10 '17 at 22:01

3 Answers3

3

Use this monsterous-already-cooked regex:

$regex = "[\\x{fe00}-\\x{fe0f}\\x{2712}\\x{2714}\\x{2716}\\x{271d}\\x{2721}\\x{2728}\\x{2733}\\x{2734}\\x{2744}\\x{2747}\\x{274c}\\x{274e}\\x{2753}-\\x{2755}\\x{2757}\\x{2763}\\x{2764}\\x{2795}-\\x{2797}\\x{27a1}\\x{27b0}\\x{27bf}\\x{2934}\\x{2935}\\x{2b05}-\\x{2b07}\\x{2b1b}\\x{2b1c}\\x{2b50}\\x{2b55}\\x{3030}\\x{303d}\\x{1f004}\\x{1f0cf}\\x{1f170}\\x{1f171}\\x{1f17e}\\x{1f17f}\\x{1f18e}\\x{1f191}-\\x{1f19a}\\x{1f201}\\x{1f202}\\x{1f21a}\\x{1f22f}\\x{1f232}-\\x{1f23a}\\x{1f250}\\x{1f251}\\x{1f300}-\\x{1f321}\\x{1f324}-\\x{1f393}\\x{1f396}\\x{1f397}\\x{1f399}-\\x{1f39b}\\x{1f39e}-\\x{1f3f0}\\x{1f3f3}-\\x{1f3f5}\\x{1f3f7}-\\x{1f4fd}\\x{1f4ff}-\\x{1f53d}\\x{1f549}-\\x{1f54e}\\x{1f550}-\\x{1f567}\\x{1f56f}\\x{1f570}\\x{1f573}-\\x{1f579}\\x{1f587}\\x{1f58a}-\\x{1f58d}\\x{1f590}\\x{1f595}\\x{1f596}\\x{1f5a5}\\x{1f5a8}\\x{1f5b1}\\x{1f5b2}\\x{1f5bc}\\x{1f5c2}-\\x{1f5c4}\\x{1f5d1}-\\x{1f5d3}\\x{1f5dc}-\\x{1f5de}\\x{1f5e1}\\x{1f5e3}\\x{1f5ef}\\x{1f5f3}\\x{1f5fa}-\\x{1f64f}\\x{1f680}-\\x{1f6c5}\\x{1f6cb}-\\x{1f6d0}\\x{1f6e0}-\\x{1f6e5}\\x{1f6e9}\\x{1f6eb}\\x{1f6ec}\\x{1f6f0}\\x{1f6f3}\\x{1f910}-\\x{1f918}\\x{1f980}-\\x{1f984}\\x{1f9c0}\\x{3297}\\x{3299}\\x{a9}\\x{ae}\\x{203c}\\x{2049}\\x{2122}\\x{2139}\\x{2194}-\\x{2199}\\x{21a9}\\x{21aa}\\x{231a}\\x{231b}\\x{2328}\\x{2388}\\x{23cf}\\x{23e9}-\\x{23f3}\\x{23f8}-\\x{23fa}\\x{24c2}\\x{25aa}\\x{25ab}\\x{25b6}\\x{25c0}\\x{25fb}-\\x{25fe}\\x{2600}-\\x{2604}\\x{260e}\\x{2611}\\x{2614}\\x{2615}\\x{2618}\\x{261d}\\x{2620}\\x{2622}\\x{2623}\\x{2626}\\x{262a}\\x{262e}\\x{262f}\\x{2638}-\\x{263a}\\x{2648}-\\x{2653}\\x{2660}\\x{2663}\\x{2665}\\x{2666}\\x{2668}\\x{267b}\\x{267f}\\x{2692}-\\x{2694}\\x{2696}\\x{2697}\\x{2699}\\x{269b}\\x{269c}\\x{26a0}\\x{26a1}\\x{26aa}\\x{26ab}\\x{26b0}\\x{26b1}\\x{26bd}\\x{26be}\\x{26c4}\\x{26c5}\\x{26c8}\\x{26ce}\\x{26cf}\\x{26d1}\\x{26d3}\\x{26d4}\\x{26e9}\\x{26ea}\\x{26f0}-\\x{26f5}\\x{26f7}-\\x{26fa}\\x{26fd}\\x{2702}\\x{2705}\\x{2708}-\\x{270d}\\x{270f}]|\\x{23}\\x{20e3}|\\x{2a}\\x{20e3}|\\x{30}\\x{20e3}|\\x{31}\\x{20e3}|\\x{32}\\x{20e3}|\\x{33}\\x{20e3}|\\x{34}\\x{20e3}|\\x{35}\\x{20e3}|\\x{36}\\x{20e3}|\\x{37}\\x{20e3}|\\x{38}\\x{20e3}|\\x{39}\\x{20e3}|\\x{1f1e6}[\\x{1f1e8}-\\x{1f1ec}\\x{1f1ee}\\x{1f1f1}\\x{1f1f2}\\x{1f1f4}\\x{1f1f6}-\\x{1f1fa}\\x{1f1fc}\\x{1f1fd}\\x{1f1ff}]|\\x{1f1e7}[\\x{1f1e6}\\x{1f1e7}\\x{1f1e9}-\\x{1f1ef}\\x{1f1f1}-\\x{1f1f4}\\x{1f1f6}-\\x{1f1f9}\\x{1f1fb}\\x{1f1fc}\\x{1f1fe}\\x{1f1ff}]|\\x{1f1e8}[\\x{1f1e6}\\x{1f1e8}\\x{1f1e9}\\x{1f1eb}-\\x{1f1ee}\\x{1f1f0}-\\x{1f1f5}\\x{1f1f7}\\x{1f1fa}-\\x{1f1ff}]|\\x{1f1e9}[\\x{1f1ea}\\x{1f1ec}\\x{1f1ef}\\x{1f1f0}\\x{1f1f2}\\x{1f1f4}\\x{1f1ff}]|\\x{1f1ea}[\\x{1f1e6}\\x{1f1e8}\\x{1f1ea}\\x{1f1ec}\\x{1f1ed}\\x{1f1f7}-\\x{1f1fa}]|\\x{1f1eb}[\\x{1f1ee}-\\x{1f1f0}\\x{1f1f2}\\x{1f1f4}\\x{1f1f7}]|\\x{1f1ec}[\\x{1f1e6}\\x{1f1e7}\\x{1f1e9}-\\x{1f1ee}\\x{1f1f1}-\\x{1f1f3}\\x{1f1f5}-\\x{1f1fa}\\x{1f1fc}\\x{1f1fe}]|\\x{1f1ed}[\\x{1f1f0}\\x{1f1f2}\\x{1f1f3}\\x{1f1f7}\\x{1f1f9}\\x{1f1fa}]|\\x{1f1ee}[\\x{1f1e8}-\\x{1f1ea}\\x{1f1f1}-\\x{1f1f4}\\x{1f1f6}-\\x{1f1f9}]|\\x{1f1ef}[\\x{1f1ea}\\x{1f1f2}\\x{1f1f4}\\x{1f1f5}]|\\x{1f1f0}[\\x{1f1ea}\\x{1f1ec}-\\x{1f1ee}\\x{1f1f2}\\x{1f1f3}\\x{1f1f5}\\x{1f1f7}\\x{1f1fc}\\x{1f1fe}\\x{1f1ff}]|\\x{1f1f1}[\\x{1f1e6}-\\x{1f1e8}\\x{1f1ee}\\x{1f1f0}\\x{1f1f7}-\\x{1f1fb}\\x{1f1fe}]|\\x{1f1f2}[\\x{1f1e6}\\x{1f1e8}-\\x{1f1ed}\\x{1f1f0}-\\x{1f1ff}]|\\x{1f1f3}[\\x{1f1e6}\\x{1f1e8}\\x{1f1ea}-\\x{1f1ec}\\x{1f1ee}\\x{1f1f1}\\x{1f1f4}\\x{1f1f5}\\x{1f1f7}\\x{1f1fa}\\x{1f1ff}]|\\x{1f1f4}\\x{1f1f2}|\\x{1f1f5}[\\x{1f1e6}\\x{1f1ea}-\\x{1f1ed}\\x{1f1f0}-\\x{1f1f3}\\x{1f1f7}-\\x{1f1f9}\\x{1f1fc}\\x{1f1fe}]|\\x{1f1f6}\\x{1f1e6}|\\x{1f1f7}[\\x{1f1ea}\\x{1f1f4}\\x{1f1f8}\\x{1f1fa}\\x{1f1fc}]|\\x{1f1f8}[\\x{1f1e6}-\\x{1f1ea}\\x{1f1ec}-\\x{1f1f4}\\x{1f1f7}-\\x{1f1f9}\\x{1f1fb}\\x{1f1fd}-\\x{1f1ff}]|\\x{1f1f9}[\\x{1f1e6}\\x{1f1e8}\\x{1f1e9}\\x{1f1eb}-\\x{1f1ed}\\x{1f1ef}-\\x{1f1f4}\\x{1f1f7}\\x{1f1f9}\\x{1f1fb}\\x{1f1fc}\\x{1f1ff}]|\\x{1f1fa}[\\x{1f1e6}\\x{1f1ec}\\x{1f1f2}\\x{1f1f8}\\x{1f1fe}\\x{1f1ff}]|\\x{1f1fb}[\\x{1f1e6}\\x{1f1e8}\\x{1f1ea}\\x{1f1ec}\\x{1f1ee}\\x{1f1f3}\\x{1f1fa}]|\\x{1f1fc}[\\x{1f1eb}\\x{1f1f8}]|\\x{1f1fd}\\x{1f1f0}|\\x{1f1fe}[\\x{1f1ea}\\x{1f1f9}]|\\x{1f1ff}[\\x{1f1e6}\\x{1f1f2}\\x{1f1fc}]";

Inside a preg_replace_callback():

var_dump(preg_replace_callback("@$regex@u", function($match) {
    return $match[0]."  ";
}, ''));

Outputs:

string(18) "      "

Live demo

revo
  • 47,783
  • 14
  • 74
  • 117
  • Love this strategy. But won't it change every time they add new emojis? – Ryan Jan 10 '17 at 23:38
  • 1
    This includes Unicode blocks where emojis can be defined. – revo Jan 11 '17 at 15:59
  • Upon further review of all answers. This is the one that bests accomplishes the goal of spaces around emojis. Other answers padded all non-alphanumerics (`'` for example). Or split words like `mañana` into `ma ñ ana`. Because @revo was able to isolate emoji unicode blocks, it all worked out. Thanks! – Ryan Jan 12 '17 at 05:25
1

I found this function that someone had added in the PHP docs that splits a multibyte string into an array of characters (like str_split) and modified it.

function addSpaces($string) {
    $strlen = mb_strlen($string);
    $new_string = '';
    while ($strlen) {
        $char = mb_substr($string,0,1,"UTF-8");
        if (strlen($char) > 1) {
            $new_string .= " $char ";
        } else {
            $new_string .= $char;
        }
        $string = mb_substr($string,1,$strlen,"UTF-8");
        $strlen = mb_strlen($string);
    }
    return $new_string;
}

This question has other ways to do that split that could be similarly modified. The modification is, if strlen of one of the split characters is greater than 1, then it's multibyte, so add the spaces.

Community
  • 1
  • 1
Don't Panic
  • 41,125
  • 10
  • 61
  • 80
  • This worked. Creative way to get around the solution, even if it took a few more lines to get there. – Ryan Jan 10 '17 at 22:39
  • 1
    You're welcome. There's probably a faster way to do it with regex, but that's not really my forte. – Don't Panic Jan 10 '17 at 22:40
1

Simple regex replace could work as well...

mb_regex_encoding("UTF-8");

echo mb_ereg_replace(
    '([^\p{L}\s])',
    ' \\1 ', 
    'text  other text '
);

outputs: text other text


function pad_emojis($string) {
    $default_encoding = mb_regex_encoding();
    mb_regex_encoding("UTF-8");
    $string = mb_ereg_replace('([^\p{L}\s])', ' \\1 ', $string);
    mb_regex_encoding($default_encoding);
    return $string;
}
Ryan
  • 14,682
  • 32
  • 106
  • 179
Robert
  • 5,703
  • 2
  • 31
  • 32
  • Kudos for simplicity. This might be my favorite. Testing it out now. – Ryan Jan 10 '17 at 23:40
  • Confirmed. Works as advertised. Thanks so much. – Ryan Jan 11 '17 at 01:10
  • 1
    Yes, and it will apply padding on many many more characters like punctuation marks. @Ryan – revo Jan 11 '17 at 16:00
  • 1
    Hmm.. @revo is right.. Depending on your usage you could also strip out the whole range like this: `mb_ereg_replace('([^\x{0000}-\x{FFFF}])', ...` so you're sure you only match multibyte chars. – Robert Jan 11 '17 at 16:59