2

I am using Tweepy to stream tweets and would like to record them in a CSV format so I can play around with them or load them in database later. Please keep in mind that I am a noob, but I do realize there are multiple ways of handling this (suggestions are very welcome).

Long story short, I need to convert and append multiple Python dictionaries to a CSV file. I already did my research (How do I write a Python dictionary to a csv file?) and tried doing this with DictWriter and writer methods.

However, there are few more things that need to be accomplished:

1) Write key as header only once.

2) As new tweet is streamed, value needs to be appended without overwriting previous rows.

3) If value is missing record NULL.

4) Skip/fix ascii codec errors.

Here is the format of what I would like to end up with (each value is in its individual cell):

Header1_Key_1 Header2_Key_2 Header3_Key_3...

Row1_Value_1 Row1_Value_2 Row1_Value_3...

Row2_Value_1 Row2_Value_2 Row2_Value_3...

Row3_Value_1 Row3_Value_2 Row3_Value_3...

Row4_Value_1 Row4_Value_2 Row4_Value_3...

Here is my code:

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import csv
import json

consumer_key="XXXX"
consumer_secret="XXXX"
access_token="XXXX"
access_token_secret="XXXX"

class StdOutListener(StreamListener):

    def on_data(self, data):
        json_data = json.loads(data)

        data_header = json_data.keys()
        data_row = json_data.values()

        try:
            with open('csv_tweet3.csv', 'wb') as f:
                w = csv.DictWriter(f, data_header)
                w.writeheader(data_header)
                w.writerow(json_data)
        except BaseException, e:
            print 'Something is wrong', str(e)

        return True

    def on_error(self, status):
        print status

if __name__ == '__main__':
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)

    stream = Stream(auth, l)
    stream.filter(track=['world cup'])

Thank you in advance!

Community
  • 1
  • 1
verkter
  • 758
  • 4
  • 15
  • 29
  • Are you saying what you have isn't working? Also, you could always just dump the json to a file, one entry per line... – monkut Jun 11 '14 at 05:43
  • It is not working. Yes, you can, but I am trying to get in csv format. – verkter Jun 11 '14 at 05:55
  • 1
    when you say it's 'not working' is there anything specific that doesn't seem to be working. Is there an exception, for example? – monkut Jun 11 '14 at 06:36
  • I get an ascii codec error when this is running and I can't open output csv file with any editors. – verkter Jun 11 '14 at 16:51
  • 1
    This seems like a brute force method, but why not keep tabs of the rows yourself? EX: `headers = []` `values = []` `for key, value in json.iteritems():` ` headers.append(key)` ` values.append(values)` `` `` `csv.writerow(headers)` `csv.writerow(values)` – AdriVelaz Jun 13 '14 at 04:45
  • But what happens when second dictionary gets written? What can I do so header(key) gets written only once and rows(values) get appended based on the header? – verkter Jun 13 '14 at 04:51
  • Are you always getting back the same keys for each json? – AdriVelaz Jun 13 '14 at 04:59
  • @AdriVelaz Yes. This part just got answered here [link] (http://stackoverflow.com/questions/24197840/writing-multiple-dictionaries-to-csv-with-one-header-with-python/24197913#24197913) What do you think? – verkter Jun 13 '14 at 05:04
  • 1
    That looks great. Totally go with that. – AdriVelaz Jun 13 '14 at 05:12

1 Answers1

1

I have done a similar thing with facebook's graph API (facepy module)!

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import csv
import json

consumer_key="XXXX"
consumer_secret="XXXX"
access_token="XXXX"
access_token_secret="XXXX"

class StdOutListener(StreamListener):
    _headers = None
    def __init__(self,headers,*args,**keys):
        StreamListener.__init__(self,*args,**keys)
        self._headers = headers

    def on_data(self, data):
        json_data = json.loads(data)

        #data_header = json_data.keys()
        #data_row = json_data.values()

        try:
            with open('csv_tweet3.csv', 'ab') as f: # a for append
                w = csv.writer(f)
                # write!
                w.writerow(self._valToStr(json_data[header])
                           if header in json_data else ''
                           for header in self._headers)
        except Exception, e:
            print 'Something is wrong', str(e)

        return True

    @static_method
    def _valToStr(o):
        # json returns a set number of datatypes - parse dependingly
        # https://docs.python.org/2/library/json.html#encoders-and-decoders
        if type(o)==unicode: return self._removeNonASCII(o)
        elif type(o)==bool: return str(o)
        elif type(o)==None: return ''
        elif ...
        ...

    def _removeNonASCII(s):
        return ''.join(i if ord(i)<128 else '' for i in s)

    def on_error(self, status):
        print status

if __name__ == '__main__':
    headers = ['look','at','twitter','api',
               'to','find','all','possible',
               'keys']

    # initialize csv file with header info
    with open('csv_tweet3.csv', 'wb') as f:
        w = csv.writer(headers)

    l = StdOutListener(headers)
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)

    stream = Stream(auth, l)
    stream.filter(track=['world cup'])

It's not copy&paste ready, but it's clear enough to where you should be able to finish it.
For performance, you may want to look opening the file, write several records, then close the file. This way you're not consistently opening, initializing the csv writer, appending, then closing the file. I'm not familiar with the tweepy API, so I'm not sure exactly how this would work - but it's worth looking into.

If you run into any trouble, I'll be happy to help - enjoy!

owns
  • 305
  • 2
  • 7