How to remove all instances of any emoji from a twitter post in Python 2.7

Question

I'm working on a project that gets tweets from twitter using tweepy and processes the text. The problem that I am having is that I can't have any emoji's, special characters, etc.. Unfortunately one of the libraries that I am using doesn't support python 3 so I have to use python 2.7. Is there any way to remove everything except the "human readable text". I have been using the ftfy library but I still get stuff like this:

∩┐╜∩┐╜
φï░φîî∞▒ù
ï¿½ï¿½

my code:

import tweepy
from ftfy import fix_text,fix_encoding
from requests.exceptions import ConnectionError
from requests.packages.urllib3.exceptions import ProtocolError,ReadTimeoutError
import time
import exceptions

consumer_key = '...'
consumer_secret = '...'

access_token = '...'
access_token_secret = '...'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)



class MyStreamListener(tweepy.StreamListener):
    def on_connect(self):
        print 'Connected'
    def on_status(self, status):
        fixed_text = fix_text(fix_encoding(status.text)).encode('utf-8')
        print fixed_text
        return True
    def on_error(self, status):
        print status
        return False


running = True
while running is True:
    try:
        print 'Connecting'
        myStreamListener = MyStreamListener()
        myStream = tweepy.Stream(auth=auth,listener=myStreamListener)
        myStream.filter(track=['python'])
    except ConnectionError:
        print 'Connection Error: Waiting 10 seconds before retrying'
        time.sleep(10)
    except ProtocolError:
        print 'ProtocolError: Waiting 10 seconds before retrying'
        time.sleep(10)
    except ReadTimeoutError:
        print 'Read Timeout Error: Waiting 10 seconds before retrying'

note: this is just my test script to learn how to take tweets from twitter and print them

What do you consider to be a "special character" or "human readable text"? — 一二三, Apr 18 '16 at 12:37
well the text that comes from these tweets are going to be used in a natural language processing program that I'm starting. i don't mind special characters like 'è', — Justin6533, Apr 18 '16 at 14:23
I need to get rid of all of the characters than are not "natural language" like emoji's. — Justin6533, Apr 18 '16 at 14:38

score 0 · Answer 1 · answered Apr 18 '16 at 10:00

0

If it is returning something along the lines of bad character range then this code below should work.

import re
try:
    # UCS-4
    highpoints = re.compile(u'[U00010000-U0010ffff]')
except re.error:
    # UCS-2
    highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')

answered Apr 18 '16 at 10:00

The Boat

27
7

@Rowand Adair : It gave me some weird outputs. I tried something like this here: [link](http://stackoverflow.com/questions/26568722/remove-unicode-emoji-using-re-in-python?rq=1) but it still had emoji's in the output. – Justin6533 Apr 18 '16 at 10:52
Check the character range for emojis and implement that. Specifically the emoji character range for "Tweepy" – The Boat Apr 18 '16 at 11:46
@Rowand Adair : Ok so I found a list of all the possible emoji's in Unicode, but I have a narrow build of python ucs-2, but I believe that the Unicode that I found is ucs-4 (I'm still slightly confused about some of this). How would I go about transferring the Unicode that I found into something that my version of python can use? – Justin6533 Apr 18 '16 at 14:31
btw I'm on Windows 10 64-bit – Justin6533 Apr 18 '16 at 14:35
Sorry for such a long reply Justin, the code I have should work but I will try and get your code working, could you also add a link to any external software you're using along with this and link me the build of python you're using as I am positive my system has a different version installed. Thank you and apologies. – The Boat Apr 21 '16 at 11:07

How to remove all instances of any emoji from a twitter post in Python 2.7

φï░φîî∞▒ù

1 Answers1