0

I´m developing an analyzing program for Twitter Data. I´m using mongoDB and at the moment. I try to write a Java program to get tweets from the Twitter API and put them in the database. Getting the Tweets already works very well, but I have a problem when I want to put them in the database. As the Twitter API often returns just the same Tweets, I have to place some kind of index in the database.

First of all, I connect to the database and get the collection related to the search-term, or create this collection if this doesn´t exist.

public void connectdb(String keyword)
        {
            try {
                // on constructor load initialize MongoDB and load collection
                initMongoDB();
                items = db.getCollection(keyword);
                BasicDBObject index = new BasicDBObject("tweet_ID", 1);
                items.ensureIndex(index);



            } catch (MongoException ex) {
                System.out.println("MongoException :" + ex.getMessage());
            }

        }

Then I get the tweets and put them in the database:

public void getTweetByQuery(boolean loadRecords, String keyword) {

            if (cb != null) {
                TwitterFactory tf = new TwitterFactory(cb.build());
                Twitter twitter = tf.getInstance();
                try {
                    Query query = new Query(keyword);
                    query.setCount(50);
                    QueryResult result;
                    result = twitter.search(query);
                    System.out.println("Getting Tweets...");
                    List<Status> tweets = result.getTweets();

                    for (Status tweet : tweets) {

                        BasicDBObject basicObj = new BasicDBObject();
                        basicObj.put("user_name", tweet.getUser().getScreenName());
                        basicObj.put("retweet_count", tweet.getRetweetCount());
                        basicObj.put("tweet_followers_count", tweet.getUser().getFollowersCount());

                        UserMentionEntity[] mentioned = tweet.getUserMentionEntities();
                        basicObj.put("tweet_mentioned_count", mentioned.length);
                        basicObj.put("tweet_ID", tweet.getId());
                        basicObj.put("tweet_text", tweet.getText());


                        if (mentioned.length > 0) {
//                    System.out.println("Mentioned length " + mentioned.length + " Mentioned: " + mentioned[0].getName());
                        }
                        try {
                            items.insert(basicObj);
                        } catch (Exception e) {
                            System.out.println("MongoDB Connection Error : " + e.getMessage());
                            loadMenu();
                        }
                    }
                    // Printing fetched records from DB.
                    if (loadRecords) {
                        getTweetsRecords();
                    }

                } catch (TwitterException te) {
                    System.out.println("te.getErrorCode() " + te.getErrorCode());
                    System.out.println("te.getExceptionCode() " + te.getExceptionCode());
                    System.out.println("te.getStatusCode() " + te.getStatusCode());
                    if (te.getStatusCode() == 401) {
                        System.out.println("Twitter Error : \nAuthentication credentials (https://dev.twitter.com/pages/auth) were missing or incorrect.\nEnsure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.");
                    } else {
                        System.out.println("Twitter Error : " + te.getMessage());
                    }


                    loadMenu();
                }
            } else {
                System.out.println("MongoDB is not Connected! Please check mongoDB intance running..");
            }
        }

But as I mentioned before, there are often the same tweets, and they have duplicates in the database. I think the tweet_ID field is a good field for an index and should be unique in the collection.

fvrghl
  • 3,642
  • 5
  • 28
  • 36
JulianHi
  • 286
  • 2
  • 4
  • 14

2 Answers2

0

Set the unique option on your index to have MongoDb enforce uniqueness:

items.ensureIndex(index, new BasicDBObject("unique", true));

Note that you'll need to manually drop the existing index and remove all duplicates or you won't be able to create the unique index.

JohnnyHK
  • 305,182
  • 66
  • 621
  • 471
0

This question is already answered but I would like to contribute a bit since MongoDB API 2.11 offers a method which receives unique option as a parameter:

public void ensureIndex(DBObject keys, String name, boolean unique)

A minor remind to someone who would like to store json documents on MongoDBNote is that uniqueness must be applied to a BasicObject key and not over values. For example:

BasicDBObject basicObj = new BasicDBObject();
basicObj.put("user_name", tweet.getUser().getScreenName());
basicObj.put("retweet_count", tweet.getRetweetCount());
basicObj.put("tweet_ID", tweet.getId());
basicObj.put("tweet_text", tweet.getText());
basicObj.put("a_json_text", "{"info_details":{"info_id":"1234"},"info_date":{"year":"2012"}, {"month":"12"}, {"day":"10"}}");

On this case, you can create unique index only to basic object keys:

BasicDBObject index = new BasicDBObject();
int directionOrder = 1;
index.put("tweet_ID", directionOrder);
boolean isUnique = true;
items.ensureIndex(index, "unique_tweet_ID", isUnique);

Any index regarding JSON value like "info_id" would not work since it´s not a BasicObject key.

Using indexes on MongDB is not as easy as it sounds. You may also check MongoDB docs for more details here Mongo Indexing Tutorials and Mongo Index Concepts. Direction order might be pretty important to understand once you need a composed index which is well explained here Why Direction order matter.

Community
  • 1
  • 1
Marcio Jasinski
  • 1,439
  • 16
  • 22