Duplicate tweet removal from csv file

Question

With the part of code shown below i fetch tweets from twitter and store them initially in "backup.txt". I also create a file "tweets3.csv"and save some specific fields of each tweets. But i realized some tweets have exactly the same text (duplicates). How could i remove those from my csv file?

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time
import json
import csv


ckey = ''
csecret = ''
atoken = ''
asecret = ''

class listener(StreamListener):
    def on_data(self, data):
        try:
            all_data = json.loads(data)
            with open("backup.txt", 'a') as backup:
                backup.write(str(all_data) + "\n")
                backup.close()

            text = str(all_data["text"]).encode("utf-8")
            id = str(all_data["id"]).encode("utf-8")
            timestamp = str(all_data["timestamp_ms"]).encode("utf-8")
            sn = str(all_data["user"]["screen_name"]).encode("utf-8")
            user_id = str(all_data["user"]["id"]).encode("utf-8")
            create = str(all_data["created_at"]).encode("utf-8")
            follower = str(all_data["user"]["followers_count"]).encode("utf-8")
            following = str(all_data["user"]["following"]).encode("utf-8")
            status = str(all_data["user"]["statuses_count"]).encode("utf-8")

        # text = data.split(',"text":"')[1].split('","source')[0]
        # name = data.split(',"screen_name":"')[1].split('","location')[0]
            contentlist = []
            contentlist.append(text)
            contentlist.append(id)
            contentlist.append(timestamp)
            contentlist.append(sn)
            contentlist.append(user_id)
            contentlist.append(create)
            contentlist.append(follower)
            contentlist.append(following)
            contentlist.append(status)
            print contentlist
            f = open("tweets3.csv", 'ab')
            wrt = csv.writer(f, dialect='excel')
            try:
                wrt.writerow(contentlist)
            except UnicodeEncodeError, UnicodeEncodeError:
                return True
            return True
        except BaseException, e:

            print 'failed on data',type(e),str(e)
            time.sleep(3)

    def on_error(self, status):
        print "Error status:" + str(status)


auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track=["zikavirus"], languages=['en'])

I think you could do a list variable and every time you go over a tweet, you iterate over the list and check, if this id exists or not. If yes, do nothing. If no, add the id to the list. — Fusseldieb, Oct 07 '16 at 14:09

Fusseldieb · Accepted Answer · 2016-10-07T15:08:01.253

I wrote this code that makes a list, and every time that it gets over a tweet, it checks that list. If the text doesn't exists, add it to the list.

# Defines a list - It stores all unique tweets
tweetChecklist = [];

# All your tweets. I represent them as a list to test the code
AllTweets = ["Hello", "HelloFoo", "HelloBar", "Hello", "hello", "Bye"];

# Goes over all "tweets"
for current_tweet in AllTweets:
        # If tweet doesn't exist in the list
        if current_tweet not in tweetChecklist:
            tweetChecklist.append(current_tweet);
            # Do what you want with this tweet, it won't appear two times...

# Print ["Hello", "HelloFoo", "HelloBar", "hello", "Bye"]
# Note that the second Hello doesn't show up - It's what you want
# However, it's case sensitive.
print(tweetIDlist);
# Clear the list
tweetChecklist = [];

I think your code should show like this after implementing my solution in it:

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time
import json
import csv

# Define a list - It stores all unique tweets 
# Clear this list after completion of fetching all tweets
tweetChecklist = [];

ckey = ''
csecret = ''
atoken = ''
asecret = ''

class listener(StreamListener):
    def on_data(self, data):
        try:
            all_data = json.loads(data)
            with open("backup.txt", 'a') as backup:
                backup.write(str(all_data) + "\n")
                backup.close()

            text = str(all_data["text"]).encode("utf-8")
            id = str(all_data["id"]).encode("utf-8")
            timestamp = str(all_data["timestamp_ms"]).encode("utf-8")
            sn = str(all_data["user"]["screen_name"]).encode("utf-8")
            user_id = str(all_data["user"]["id"]).encode("utf-8")
            create = str(all_data["created_at"]).encode("utf-8")
            follower = str(all_data["user"]["followers_count"]).encode("utf-8")
            following = str(all_data["user"]["following"]).encode("utf-8")
            status = str(all_data["user"]["statuses_count"]).encode("utf-8")

            # If the text does not exist in the list that stores all unique tweets
            if text not in tweetChecklist:
                # Store it, so that on further times with the same text, 
                # it didn't reach this code
                tweetChecklist.append(current_tweet);

                # Now, do your unique stuff
                contentlist = []
                contentlist.append(text)
                contentlist.append(id)
                contentlist.append(timestamp)
                contentlist.append(sn)
                contentlist.append(user_id)
                contentlist.append(create)
                contentlist.append(follower)
                contentlist.append(following)
                contentlist.append(status)
                print contentlist
                f = open("tweets3.csv", 'ab')
                wrt = csv.writer(f, dialect='excel')
                try:
                    wrt.writerow(contentlist)
                except UnicodeEncodeError, UnicodeEncodeError:
                    return True
                return True
            except BaseException, e:

                print 'failed on data',type(e),str(e)
                time.sleep(3)

        def on_error(self, status):
            print "Error status:" + str(status)


    auth = OAuthHandler(ckey, csecret)
    auth.set_access_token(atoken, asecret)
    twitterStream = Stream(auth, listener())
    twitterStream.filter(track=["zikavirus"], languages=['en'])

Alternatively you could use a set for the tweets. As look-ups for sets are generally (a lot) faster than for lists, for a large file, that might be quite beneficial. — Nelewout, Oct 07 '16 at 15:15
@N.Wouda Take a look: http://stackoverflow.com/questions/2831212/python-sets-vs-lists . [...] (sets) are slower than lists when it comes to iterating over their contents [...] - Remember: He don't want to check the tweet-ids, but their _text / content_. — Fusseldieb, Oct 07 '16 at 15:57
The set could consist of content only - I never mentioned IDs. Iterating is not necessary, as a simple existence check is sufficient - quoting your source, "Sets are significantly faster when it comes to determining if an object is present in the set". — Nelewout, Oct 11 '16 at 18:20

Duplicate tweet removal from csv file

1 Answers1