4

I am trying to replace emoji from Arabic tweets using java.

I used this code:

String line = "اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز ";
Pattern unicodeOutliers = Pattern.compile("([\u1F601-\u1F64F])", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(line);
line = unicodeOutlierMatcher.replaceAll(" $1 ");

But it is not replacing them. Even if I am matching only the character itself "\u1F602" it is not replacing it. May be because it is 5 digits after the u?! I am not sure, just a guess.

Note that:

1- the emotion at the end of the tweet () is the "U+1F602" which is "face with tears of joy"

2- this question is not a duplicate for this question.

Any Ideas?

Community
  • 1
  • 1
Daisy
  • 847
  • 3
  • 13
  • 34
  • possible duplicate of [What is the regex to extract all the emojis from a string?](http://stackoverflow.com/questions/24840667/what-is-the-regex-to-extract-all-the-emojis-from-a-string) – Karol S Nov 09 '14 at 00:33
  • What happens if you omit the space in the middle of the regular expression? – Dawood ibn Kareem Nov 09 '14 at 02:53
  • @David Wallace: It was a writing mistake, the original code without the space. – Daisy Nov 09 '14 at 11:27
  • @KarolS: The advises in the suggested url could not help. – Daisy Nov 09 '14 at 11:28
  • Oh, I've found the answer, Daisy. It's the 5-digit unicode thing, as you suspected. The JLS says that unicode escape sequences must be 4 digits. – Dawood ibn Kareem Nov 09 '14 at 11:38
  • @DavidWallace: Thanks, Take your time, and I will try to know more about this "surrogate pair". – Daisy Nov 09 '14 at 12:55

2 Answers2

5

From the Javadoc for the Pattern class

A Unicode character can also be represented in a regular-expression by using its Hex notation(hexadecimal code point value) directly as described in construct \x{...}, for example a supplementary character U+2011F can be specified as \x{2011F}, instead of two consecutive Unicode escape sequences of the surrogate pair \uD840\uDD1F.

This means that the regular expression that you're looking for is ([\x{1F601}-\x{1F64F}]). Of course, when you write this as a Java String literal, you must escape the backslashes.

Pattern unicodeOutliers = Pattern.compile("([\\x{1F601}-\\x{1F64F}])");

Note that the construct \x{...} is only available from Java 7.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Dawood ibn Kareem
  • 77,785
  • 15
  • 98
  • 110
  • I'm not really bothered with reputation. If you are concerned about the responsibility of maintaining the technical correctness of the part which I added in, please ping me again and I will split into my own answer. – nhahtdh Nov 10 '14 at 06:48
  • Done as requested. I took most of my edit out, except for some stylistic edits and the version information. – nhahtdh Nov 10 '14 at 07:37
5

Java 5 and 6

If you are stuck running your program on Java 5 or 6 JVM, and you want to match characters in the range from U+1F601 to U+1F64F, use surrogate pairs in the character class:

Pattern emoticons = Pattern.compile("[\uD83D\uDE01-\uD83D\uDE4F]");

This method is valid even in Java 7 and above, since in Sun/Oracle's implementation, if you decompile Pattern.compile() method, the String containing the pattern is converted into an array of code points before compilation.

Java 7 and above

  1. You can use the construct \x{...} in David Wallace's answer, which is available from Java 7.

  2. Or alternatively, you can also specify the whole Emoticons Unicode block, which spans from code point U+1F600 (instead of U+1F601) to U+1F64F.

    Pattern emoticons = Pattern.compile("\\p{InEmoticons}");
    

    Since Emoticons block support is added in Java 7, this method is also only valid from Java 7.

  3. Although the other methods are preferred, you can specify supplemental characters by specifying the escape in the regex. While there is no reason to do this in the source code, this change in Java 7 corrects the behavior in applications where regex is used for searching, and directly pasting the character is not possible.

    Pattern emoticons = Pattern.compile("[\\uD83D\\uDE01-\\uD83D\\uDE4F]");
    

    /!\ Warning

    Never ever mix the syntax together when you specify a supplemental code point, like:

    • "[\\uD83D\uDE01-\\uD83D\\uDE4F]"

    • "[\uD83D\\uDE01-\\uD83D\\uDE4F]"

    Those will specify to match the code point U+D83D and the range from code point U+DE01 to code point U+1F64F in Oracle's implementation.

Note

In Java 5 and 6, Oracle's implementation, the implementation of Pattern.u() doesn't collapse valid regex-escaped surrogate pairs "\\uD83D\\uDE01". As the result, the pattern is interpreted as 2 lone surrogates, which will fail to match anything.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • Are you sure that the surrogate pairs are handled as single characters in the pre-7 solution? I would have expected the required regexp to be `\uD83D[\uDE01-\uDE4F]`, since in most cases Java treats astral characters as pairs of characters. – Dawood ibn Kareem Nov 10 '14 at 08:07
  • @DavidWallace: `\uD83D[\uDE01-\uDE4F]` is the wrong way to do this, since it will try to match surrogate, but the string has been transformed to code point before matching begins (this is in Oracle's implementation). I'm not sure about Java 5, but Java 6u20 does support this. – nhahtdh Nov 10 '14 at 08:22
  • Really? I think I'm going to have to run some tests of my own. I don't quite 100% believe you yet. But I guess that's what testing is for. – Dawood ibn Kareem Nov 10 '14 at 08:29
  • @DavidWallace: I verified the structure of the compiled regex in Oracle's implementation (just a bunch of objects chained together), so there is no space left for doubt. If you manage to specify the regex so that a valid surrogate pair is interpreted as 2 lone surrogates in the regex, then no string can match the regex (like your regex, or the regex I put up as example). – nhahtdh Nov 10 '14 at 08:33