2

I am working with a twitter streaming package for python. I am currently using a keyword that is written in unicode to search for tweets containing that word. I am then using python to create a database csv file of the tweets. However, I want to convert the tweets back to Arabic symbols when I save them in the csv.

The errors I am receiving are all similar to "error ondata the ASCII caracters in position ___ are not within the range of 128."

Here is my code:

class listener(StreamListener):
    def on_data(self, data):
        try:
            #print data

            tweet = (str((data.split(',"text":"')[1].split('","source')[0]))).encode('utf-8')
            now = datetime.now()
            tweetsymbols =  tweet.encode('utf-8')
            print tweetsymbols

            saveThis = str(now) + ':::' + tweetsymbols.decode('utf-8')
            saveFile = open('rawtwitterdata.csv','a')
            saveFile.write(saveThis)
            saveFile.write('\n')
            saveFile.close()
            return True
Joseph P Nardone
  • 150
  • 2
  • 12

2 Answers2

7

Excel requires a Unicode BOM character written to the beginning of a UTF-8 file to view it properly. Without it, Excel assumes "ANSI" encoding, which is OS locale-dependent.

This writes a 3-row, 3-column CSV file with Arabic:

#!python2
#coding:utf8
import io
with io.open('arabic.csv','w',encoding='utf-8-sig') as f:
    s = u'إعلان يونيو وبالرغم تم. المتحدة'
    s = u','.join([s,s,s]) + u'\n'
    f.write(s)
    f.write(s)
    f.write(s)

Output:

enter image description here

For your specific example, just make sure to write a BOM character u'\xfeff' as the first characters of your file, encoded in UTF-8. In the example above, the 'utf-8-sig' codec ensures a BOM is written.

Also consult this answer, which shows how to wrap the csv module to support Unicode, or get the third party unicodecsv module.

Community
  • 1
  • 1
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

Here a snippet to write arabic in text

# coding=utf-8
import codecs
from datetime import datetime

class listener(object):


    def on_data(self, tweetsymbols):
        # python2
        # tweetsymbols is str
        # tweet = (str((data.split(',"text":"')[1].split('","source')[0]))).encode('utf-8')
        now = datetime.now()
        # work with unicode
        saveThis = unicode(now) + ':::' + tweetsymbols.decode('utf-8')
        try:
            saveFile = codecs.open('rawtwitterdata.csv', 'a', encoding="utf8")
            saveFile.write(saveThis)
            saveFile.write('\n')
        finally:
            saveFile.close()
        return self


listener().on_data("إعلان يونيو وبالرغم تم. المتحدة")

All you must know about encoding https://pythonhosted.org/kitchen/unicode-frustrations.html

Ali SAID OMAR
  • 6,404
  • 8
  • 39
  • 56