0

I'm working on a MongoDB project which stores tweets and was created by someone else. This person decided to use the Twitter tweet ID for the _id field in MongoDB, which means I now have no way to sort the tweets deterministically.

Example:

> db.tweets.find().sort({_id : 1}).limit(4)
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(1)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(2)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(3)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(5)}

The reason sorting on the field ID is non-deterministic is that at a later date, my system could add the existing tweet that has an ID of 4 to the database, meaning that the same command would give a different result set:

> db.tweets.find().sort({_id : 1}).limit(4)
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(1)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(2)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(3)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(4)}

My question is: is there a way to add a new 'field' to every entry in a collection, with a value of type ObjectID, so that I can sort on that? Or if not, what would the recommendations be for 'renaming' the _id field to say tweetId and then making the _id field of type ObjectID

Thanks

Community
  • 1
  • 1
Patrick
  • 2,769
  • 2
  • 25
  • 29

2 Answers2

1

The only way to actually change the _id field would be to copy the document, change the _id, and delete the old document, as described in this answer:

How update the _id of one MongoDB Document?

To simply add a new field, passing an update function to a cursor's forEach function should work:

db.tweets.find().forEach( 
   function (tweet) {
      db.tweets.update({_id: tweet._id}, {$set: {newFieldName: tweet._id}});
   }
);
Community
  • 1
  • 1
Shawn Bush
  • 644
  • 3
  • 6
0

Some of the snippets in the post that Shawn linked to had several flaws. Whilst the idea was right, using the command line mongo could cause several problems.

Getting a 'snapshot' of all the tweets before any new ones are added is difficult in mongo. The only way I could find to do it was to use:

$ db.tweets.find({}, {_id : 1}).toArray()

or perhaps

$ db.tweets.distinct('_id')

Unfortunately, as I had over 2 million tweets in my database this caused mongo to run out of memory. I got a "exception: distinct too big, 16mb cap" errir, Instead I used Python, here's the script:

#!/usr/bin/env python

"""A tool to work through all tweets, and convert the '_id'
from the Tweet ID into an ObjectID (saving the tweet)
ID in the 'tweetID' field
"""
import pymongo
from bson.objectid import ObjectId

if __name__ == "__main__":
    client = pymongo.MongoClient()
    db = client.guaiamum

    ids = list(t['_id'] for t in db.tweets.find({'_id': {'$type' : 18}}, {'_id' : 1}))
    for _id in ids:
        tweet = db.tweets.find_one({'_id' : _id})
        tweet['_id'] = ObjectId()
        tweet['twitterId'] = _id
        db.tweets.insert(tweet)
        db.tweets.remove(_id, multi=False)

It still took a good 1.5 hrs to run, but bizarrely that's still much quicker than using mongo

Patrick
  • 2,769
  • 2
  • 25
  • 29