2

I used this post to make a regex that would find emojis in a string of text and simply stick some space characters on either side. my regex code:

try:
    # Wide UCS-4 build
    oRes = re.compile(u'['
        u'\U0001F300-\U0001F64F'
        u'\U0001F680-\U0001F6FF'
        u'\u2600-\u26FF\u2700-\u27BF]+', 
        re.UNICODE)
except re.error:
    # Narrow UCS-2 build
    oRes = re.compile(u'('
        u'\ud83c[\udf00-\udfff]|'
        u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
        u'[\u2600-\u26FF\u2700-\u27BF])+', 
        re.UNICODE)

s2 = oRE.sub(r'  \1  ', s1)

However, I am getting some really odd behaviour where emojis are being removed, as in the example below. Any advice would be appreciated. I am using Python on a MacBook. Thanks.

INPUT

هيلاري كلينتون "متنحة" وتشير إلى عملية غش في ولاية بانسيلفانيا العتيقة قائلة: "عند فرز الاصوات ..قطعوا الكهربا ✋" #ابو_الياس

OUTPUT

هيلاري كلينتون "متنحة" وتشير إلى عملية غش في ولاية بانسيلفانيا العتيقة قائلة: "عند فرز الاصوات ..قطعوا الكهربا ✋ " #ابو_الياس

Community
  • 1
  • 1
Dr T
  • 504
  • 1
  • 7
  • 20
  • 1
    What version of python are you using? – timotree Dec 05 '16 at 17:27
  • 1
    Thanks for your reply, I am using 2.7. – Dr T Dec 05 '16 at 18:02
  • You're welcome. I don't know much about unicode in python though so someone else will have to answer your question. – timotree Dec 05 '16 at 19:23
  • Which of the two branches is executing on your system and causing the problem? If `len(u'\U0001f600')` returns 2 then you are using UCS2, if it returns 1 it's UCS4. – Meyer Dec 05 '16 at 19:58
  • 1
    @SMeyer the OP linked to a post where someone already helped them figure this out. "wow, thanks! It seems the USC-4 build works properly!"(from op) – timotree Dec 05 '16 at 20:03
  • just for completeness, len(u'\U0001f600') returns 2. – Dr T Dec 05 '16 at 21:38

1 Answers1

1

The following works for me once I correct the placement of the round brackets in your regular expressions. In the try block, you need round brackets around the whole thing if you want to create the group \1 at all; in the except block, the round brackets need to include the +, otherwise the \1 group will only capture the first of multiple relevant characters.

import re
with open('input.txt', 'rb') as f:
    s1 = f.read().decode('utf-8').strip()

try:
    # Wide UCS-4 build
    oRes = re.compile(u'(['
        u'\U0001F300-\U0001F64F'
        u'\U0001F680-\U0001F6FF'
        u'\u2600-\u26FF\u2700-\u27BF]+)', 
        re.UNICODE)
except re.error:
    # Narrow UCS-2 build
    oRes = re.compile(u'(('
        u'\ud83c[\udf00-\udfff]|'
        u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
        u'[\u2600-\u26FF\u2700-\u27BF])+)', 
        re.UNICODE)

s2 = oRes.sub(r'  \1  ', s1)

with open('output.txt', 'wb') as f:
    f.write((s1+'\n').encode('utf-8'))
    f.write((s2+'\n').encode('utf-8'))

As for the reversal of your characters, that must be an artifact of some step in your input/output or copy/paste chain not correctly handling the right-to-left nature of Arabic. It doesn't happen for me. The results look good when I open output.txt in TextWrangler on my MacBook.

jez
  • 14,867
  • 5
  • 37
  • 64
  • Good catch. I missed that the `+` wasn't in the capture. – timotree Dec 05 '16 at 20:04
  • I think this is incorrect in the except block. Note that the brackets include OR statements (|), which means that in the current code you only apply the plus to the last OR case. – Meyer Dec 05 '16 at 20:07
  • Hmm, @SMeyer is right. Actually to surround *each* emoji character with spaces (if that's the aim) requires the `+` to be removed. To surround each emoji *sequence* with spaces requires extra parentheses. I'll put the extra ones in because the `try` block suggests that surrounding whole sequences is the aim. – jez Dec 05 '16 at 20:10
  • the aim is to surround **each** emoji - so that means the + should be removed i guess. – Dr T Dec 05 '16 at 21:27
  • "As for the reversal of your characters" ??? sorry, I did not mention any reversal? – Dr T Dec 05 '16 at 21:28
  • Your INPUT and OUTPUT look like the characters might have reversed character order relative to each other. The emoji is near the beginning and preceded by `#"` in the output, whereas it's near the end and *followed* by `"#` in the input. The arabic is sometimes the same, sometimes slightly different, which I presume is the result of characters being reversed and then differently grouped into words. – jez Dec 05 '16 at 21:38