1

I'm trying to get Arabic tweets by using tweepy library in python 3.6, with English it works perfectly but when i try to get Arabic tweets i faced many problemm the problem with this last code is that the tweets in Arabic characters appear as "\u0635\u0648\u0651\u062a\u0648\u0627 "

i tried several solution in the internet but there is no one that solved my problem because most of them try to get just "text" of the tweet so they can fix the encode problem directly with the text only, but for me i want to get the whole info in json

    from tweepy.streaming import StreamListener
    from tweepy import OAuthHandler
    from tweepy import Stream
    import json


    access_token = '-'
    access_token_secret = '-'
    consumer_key = '-'
    consumer_secret = '-'


    class StdOutListener(StreamListener):

        def on_data(self, data):
            print (data.encode("UTF-8")) 
            return True


        def on_error(self, status):
            print (status)


     if __name__ == '__main__':

        l = StdOutListener()
        auth = OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_token, access_token_secret)
        stream = Stream(auth, l)

        stream.filter(  track=["عربي"]) 


 > $ python file.py > file2.txt

the results in text file and in the terminal:

{"created_at":"Thu Jan 17 12:12:16 +0000 2019","id":1085872428432195585,"id_str":"1085872428432195585","text":"RT @MALHACHIMI: \u0642\u0627\u062f\u0629 \u062d\u0631\u0643\u0629 \u0627\u0644\u0646\u0647\u0636\u0629 \u0635\u0648\u0651\u062a\u0648\u0627 \u0636\u062f \u0627\u0639\u062a\....etc}

s99
  • 119
  • 1
  • 1
  • 8

1 Answers1

1

If I do this with the first example in your question:

>>> print( "\u0635\u0648\u0651\u062a\u0648\u0627 ")
صوّتوا 

the Arabic appears. But if you display a dict at the console, without specifying how you want it displayed, Python will just use a default representation that uses the ASCII character set, and anything not printable in that character set will be represented as escapes. This is because if you wanted to code this string in a program, your IDE editor might have a problem coping with the Arabic. The reason is that switches between the left-to-right order of the Python code and the right-to-left order of the string is very hard to manage. The information hasn't been lost or mangled, just displayed in a lowest-common-denominator format.

BoarGules
  • 16,440
  • 2
  • 27
  • 44