Match and replace emoticons in string - what is the most efficient way?

Question

Wikipedia defines a lot of possible emoticons people can use. I want to match this list to words in a string. I now have this:

$string = "Lorem ipsum :-) dolor :-| samet";
$emoticons = array(
  '[HAPPY]' => array(' :-) ', ' :) ', ' :o) '), //etc...
  '[SAD]'   => array(' :-( ', ' :( ', ' :-| ')
);
foreach ($emoticons as $emotion => $icons) {
  $string = str_replace($icons, " $emotion ", $string);
}
echo $string;

Output:

Lorem ipsum [HAPPY] dolor [SAD] samet

so in principle this works. However, I have two questions:

As you can see, I'm putting spaces around each emoticon in the array, such as ' :-) ' instead of ':-)' This makes the array less readable in my opinion. Is there a way to store emoticons without the spaces, but still match against $string with spaces around them? (and as efficiently as the code is now?)
Or is there perhaps a way to put the emoticons in one variable, and explode on space to check against $string? Something like

$emoticons = array( '[HAPPY]' => ">:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^)", '[SAD]' => ":'-( :'( :'-) :')" //etc...
Is str_replace the most efficient way of doing this?

I'm asking because I need to check millions of strings, so I'm looking for the most efficient way to save processing time :)

Shouldn’t you also deal with trans-ASCII emotica? The web is over 80% Unicode now, you know. There is an entire Unicode block devoted to such things: Blk=Emoticons. But some occur elsewhere, too. — tchrist, Feb 15 '12 at 15:45
@Li-aungYip Heh, that’s a good one! No, I mean code points like U+1F609 `WINKING FACE` and U+263A `WHITE SMILING FACE` ☺. Most of them are in the Emotions block (like the first of those two above), with just a few in the legacy BMP. — tchrist, Feb 15 '12 at 15:59
@tchrist No, I am only concerned with western emoticons as defined on http://en.wikipedia.org/wiki/List_of_emoticons, but thanks for the input :) — Pr0no, Feb 15 '12 at 16:11
Unfortunately the default Ubuntu fonts don't include all of the code points from that block yet, so I get U+263A fine but U+1F609 is a box. I wonder how the support is on Win7? ;) — Li-aung Yip, Feb 15 '12 at 16:16
No winking face on Win7 here. No suitable fallback font, is the problem. — Kawa, Feb 15 '12 at 16:20

tchrist · Answer 1 · 2012-02-15T16:31:29.170

Here’s the idea using the Perl 3rd-party Regexp::Assemble module from CPAN. For example, given this program:

#!/usr/bin/env perl
use utf8;
use strict;
use warnings;

use Regexp::Assemble;

my %faces = (
    HAPPY => [qw¡ :-) :) :o) :-} ;-} :-> ;-} ¡],
    SAD   => [qw¡ :-( :( :-| ;-) ;-( ;-< |-{ ¡],
);

for my $name (sort keys %faces) {
    my $ra = Regexp::Assemble->new();
    for my $face (@{ $faces{$name} }) {
        $ra->add(quotemeta($face));
    }
    printf "%-12s => %s\n", "[$name]", $ra->re;
}

It will output this:

[HAPPY]      => (?-xism:(?::(?:-(?:[)>]|\})|o?\))|;-\}))
[SAD]        => (?-xism:(?::(?:-(?:\||\()|\()|;-[()<]|\|-\{))

There’s a bit of extra stuff there you don’t really probably need, so those would reduce to just:

[HAPPY]      => (?:-(?:[)>]|\})|o?\))|;-\}
[SAD]        => (?:-(?:\||\()|\()|;-[()<]|\|-\{

or so. You could build that into your Perl program to trim the extra bits. Then you could place the righthand sides straight into your preg_replace.

The reason I did the use utf8 was so I could use ¡ as my qw// delimiter, because I didn’t want to mess with escaping things inside there.

You wouldn’t need to do this if the whole program were in Perl, because modern versions of Perl already know to do this for you automatically. But it’s still useful to know how to use the module so you can generate patterns to use in other languages.

@Li-aungYip There’s a lot more where that comes from; you have to bear in mind [whom you’re talking to](http://shop.oreilly.com/product/9780596004927.do), you know. — tchrist, Feb 15 '12 at 16:30
OH, SHI... (At least you're not the author of Mastering Regular Expressions. And now I will have to keep an eye out for a `jfriedl` here on SO...) — Li-aung Yip, Feb 15 '12 at 16:43
@Li-aungYip For real regex mastery, you need modern pattern stuff, which Jeffrey’s MRE doesn’t cover — yet. See [this answer](http://stackoverflow.com/a/9297894/471272) for the sort of thing I mean: named groups (and more flexible ones than in Python obtain), recursive patterns, and grammatical patterns. — tchrist, Feb 15 '12 at 17:19

Li-aung Yip · Answer 2 · 2012-02-15T15:40:45.877

This sounds like a good application for regular expressions, which are a tool for fuzzy text matching and replacement. str_replace is a tool for exact text search and replace; regexps will let you search for entire classes of "text that looks something like this", where the this is defined in terms of what kinds of characters you will accept, how many of them, in what order, etc.

If you use regular expressions, then...

The \s wildcard will match whitespace, so you can match \s$emotion\s.

(Also consider the case where the emoticon occurs at the end of a string - i.e. that was funny lol :) - you can't always assume emoticons will have spaces around them. You can write a regexp that handles this.)
You can write a regular expression that will match any of the emoticons in the list. You do this using the alternation symbol |, which you can read as an OR symbol. The syntax is (a|b|c) to match pattern a OR b OR c.

For example (:\)|:-\)|:o\)) will match any of :),:-),:o). Note that I had to escape the )'s because they have a special meaning inside regexps (parentheses are used as a grouping operator.)
Premature optimisation is the root of all evil.

Try the most obvious thing first. If that doesn't work, you can optimise it later (after you profile the code to ensure this is really going to give you a tangible performance benefit.)

If you want to learn regular expressions, try Chapter 8 of the TextWrangler manual. It's a very accessible introduction to the uses and syntax of regular expressions.

Note: my advice is programming-language independent. My PHP-fu is much weaker than my Python-fu, so I can't provide sample code. :(

Diego · Answer 3 · 2012-02-15T15:45:37.150

2

I would start trying out the simplest implementation first, using str_replace and those arrays with spaces. If the performance is unacceptable, try a single regular expression per emotion. That compresses things quite a bit:

$emoticons = array(
  '[HAPPY]' => ' [:=]-?[\)\]] ', 
  '[SAD]'   => ' [:=]-?[\(\[\|] '
);

If performance is still unacceptable, you can use something fancier, like a suffix tree (see: http://en.wikipedia.org/wiki/Suffix_tree ), which allows you to scan the string only once for all emoticons. The concept is simple, you have a tree whose root is a space (since you want to match a space before the emoticon), the first children are ':' and '=', then children of ':' are ']', ')', '-', etc. You have a single loop that scans the string, char by char. When you find a space, you move to the next level in the tree, then see if the next character is one of the nodes at that level (':' or '='), if so, move to the next level, etc. If, at any point, the current char is not a node in the current level, you go back to root.

edited Feb 15 '12 at 15:45

answered Feb 15 '12 at 15:26

Diego

18,035
5
62
66

A suffix tree/finite state machine would be a very elegant solution. Kudos. (but in this case, wouldn't it be a prefix tree? ;) ) – Li-aung Yip Feb 15 '12 at 15:34
No, it's a suffix tree. The wikipedia page shows the suffix tree for the word `BANANA$` where "*The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$*" – Diego Feb 15 '12 at 15:40
Computer science, giving things unintuitive names ever since `dynamic programming` (not really a type of programming.) – Li-aung Yip Feb 15 '12 at 15:42
You want to use the Perl [Regexp::Assemble](http://search.cpan.org/perldoc?Regexp::Assemble) module to run an analysis on the set of patterns to create a prefix/suffix-tree representation as a regex. Then you can plug that resulting optimized regex into any programming language. This is especially useful for languages that aren’t smart enough to use a TRIE representation the way Perl is. – tchrist Feb 15 '12 at 15:48
@tchrist: will boiling it down to a regex let you distinguish which kind of emoticon you're matching? i.e. can you still replace `:)` with `[HAPPY]` and `:(` with `[SAD]` using one `preg_replace()`? – Li-aung Yip Feb 15 '12 at 15:50
@Li-aungYip It would, assuming you assembled one set of them for the happy ones and another for the sad ones. You would then feed the result into `preg_replace()`. Diego’s answer is pretty good, actually. The advantage of the `Regexp::Assemble` module is that it is provably optimal in its assembly, given a whole bunch of possible input patterns. It’s way smarter than just a bunch of or-pipes. Gimme a seconds and I’ll whip something up for you to show you what I mean. – tchrist Feb 15 '12 at 16:02
Provably optimal in the sense of producing the smallest possible finite state machine? – Li-aung Yip Feb 15 '12 at 16:17

score 2 · Answer 4 · edited Jun 20 '20 at 09:12

Intro Comment: Please only ask one question at once. You'll get better answers than. Next to that, you can't get good performance advice if you don't show us the metrics you've done so far.

From what I can see from your code is that you do two times a string processing you could save, putting the replacement into spaces in specific. You could unroll it with your definition first:

$emoticons = array(
  ' [HAPPY] ' => array(' :-) ', ' :) ', ' :o) '), //etc...
  ' [SAD] '   => array(' :-( ', ' :( ', ' :-| ')
);

foreach ($emoticons as $replace => $search)
{
  $string = str_replace($search, $replace, $string);
}

This will save you some fractions of a microsecond each time you call that which, well give you better performance you'll probably not notice. Which brings me to the point that you should probably write this in C and compile it.

A bit closer to C would be using a regular expression compiled once and then re-used, which has been suggested in another answer already. The benefit here is that you might have the fastest way you can do it with PHP if you run the same expression multiple times and you could generate the regular expression upfront, so you can store it in a format that is easier for you to edit. You could then cache the regular expression in case you would need to even need to tweak performance that hardly.

1. As you can see, I'm putting spaces around each emoticon in the array, such as ' :-) ' instead of ':-)' This makes the array less readable in my opinion. Is there a way to store emoticons without the spaces, but still match against $string with spaces around them? (and as efficiently as the code is now?)

Yes this is possible but not more efficiently in the sense that you would need to further process the configuration data into the replacement data. No idea about which kind of efficiency you really talk, but I assume the later, so the answer is, possible but not suitable for your very special use-case. Normally I would prefer something that's easier to edit, so to say you're more efficient to deal with it instead of caring about processing speed, because processing speed can be fairly well shorten by distributing the processing across multiple computers.

2. Or is there perhaps a way to put the emoticons in one variable, and explode on space to check against $string? Something like

$emoticons = array( '[HAPPY]' => ">:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^)", '[SAD]' => ":'-( :'( :'-) :')" //etc...

Sure, that's possible but you run into the same issues as in 1.

3. Is str_replace the most efficient way of doing this?

Well right now with the code you've offered it's the only way you ask about. As there is no alternative you tell us about, it's at least working for you which at this point in time is the most efficient way of doing that for you. So right now, yes.

Surely you're not expecting to bake compiled C code into a PHP application? Doable, sure, but not for beginners or anyone who wants to maintain their sanity. — Li-aung Yip, Feb 15 '12 at 15:45
Actually PHP is an interface to C compiled functions. As OP is asking for performance, I don't see that suggestion being far-off. However I did not suggest to bake C into a PHP application but suggested that he should do everything in C instead if performance is critical. But that's only one very little point in the answer, and I outlined the alternative with regex here if OP want's to stay in PHP (as you did as well). — hakre, Feb 15 '12 at 16:14
My programming experience now goes so far as some php - I wouldn't think of writing a compiled bit for this purpose. Let's say I want to optimize for performance within php's scripting environment :) But thanks for the suggestion! — Pr0no, Feb 15 '12 at 16:33
If you use the same regular expression pattern multiple times within the same script execution, I guess that `preg_replace` turns out to be fastest in your case. But you need to metric that so you know. — hakre, Feb 15 '12 at 16:41

score 2 · Accepted Answer · answered Feb 15 '12 at 16:20

If the $string, in which you want replace emoticons, is provided by a visitor of your site(I mean it's a user's input like comment or something), then you should not relay that there will be a space before or after the emoticon. Also there are at least couple of emoticons, that are very similar but different, like :-) and :-)). So I think that you will achieve better result if you define your emoticon's array like this:

$emoticons = array(
    ':-)' => '[HAPPY]',
    ':)' => '[HAPPY]',
    ':o)' => '[HAPPY]',
    ':-(' => '[SAD]',
    ':(' => '[SAD]',
    ...
)

And when you fill all find/replace definitions, you should reorder this array in a way, that there will be no chance to replace :-)) with :-). I believe if you sort array values by length will be enough. This is in case your are going to use str_replace(). strtr() will do this sort by length automatically!

If you are concerned about performance, you can check strtr vs str_replace, but I will suggest to make your own testing (you may get different result regarding your $string length and find/replace definitions).

The easiest way will be if your "find definitions" doesn't contain trailing spaces:

$string = strtr( $string, $emoticons );
$emoticons = str_replace( '][', '', trim( join( array_unique( $emoticons ) ), '[]' ) );
$string = preg_replace( '/\s*\[(' . join( '|', $emoticons ) . ')\]\s*/', '[$1]', $string ); // striping white spaces around word-styled emoticons

Turns out that very happy peoples use this. Found it in provided Wikipedia list, that seems @Reveller use as reference. I wasn't aware about that emoticon too, may be because I express even my top emotions just with :) — Boris Belenski, Feb 15 '12 at 16:47

Match and replace emoticons in string - what is the most efficient way?

5 Answers5

Linked