3

In my web scraping project i need to move previous day scraped data from mongo_collection to mongo_his_collection

I am using this query to move data

for record in collection.find():
    his_collection.insert(record)

collection.remove()

It works fine but sometimes it break when MongoDB collection contain above 10k rows

Suggest me some optimized query which will take less resources and do the same task

Binit Singh
  • 973
  • 4
  • 14
  • 35
  • Thanks for suggestion but Renaming would not help because i have to collect all previous scraped data in his_collection – Binit Singh Aug 30 '13 at 05:31
  • how about using the mongo export and import tools?, export the whole collection and import it to some other collection. – dunn less Aug 30 '13 at 07:18
  • Wy are you doing that? That's a lot of busy work for the database server. – WiredPrairie Aug 30 '13 at 11:06
  • @WiredPrairie To move all the previous day scraped data in history collection of MongoDB. so that after one month of scraping all the datas are stored in history collection and data of 30th day scraping is stored in collection – Binit Singh Aug 30 '13 at 11:10
  • Is that going to perform better somehow? You might as well double write the data and everyday just flush the current collection. – WiredPrairie Aug 30 '13 at 11:14
  • does this Javascript work in the Mongo shell? – Kevin Meredith Nov 08 '13 at 01:55

5 Answers5

2

You could use a MapReduce job for this.

MapReduce allows you to specify a out-collection to store the results in.

When you hava a map function which emits each document with its own _id as key and a reduce function which returns the first (and in this case only because _id's are unique) entry of the values array, the MapReduce is essentially a copy operation from the source-collection to the out-collection.

Untested code:

db.runCommand(
           {
             mapReduce: "mongo_collection",
             map: function(document) {
                  emit(document._id, document);
             },
             reduce: function(key, values) {
                  return values[0];
             },
             out: {
                  merge:"mongo_his_collection"
             }
           }
         )
Philipp
  • 67,764
  • 9
  • 118
  • 153
  • The only problem is that MR will change the document, you would need aq clean up function to return the document to its old structure – Sammaye Aug 30 '13 at 08:16
  • @Sammaye how to perform aq clean up function. could you please add a new answer with MR and cleanup function. I am using pymongo – Binit Singh Aug 30 '13 at 08:46
  • @binit I don't know if it would be that good to perform the cleanup function after, you would get the same problem as you do now... – Sammaye Aug 30 '13 at 08:48
  • @Sammaye in what way will MapReduce change the document in this example? – Philipp Aug 30 '13 at 08:52
  • Cos it will of course factor off the document as a subdocment under the field `values`, this could be catastrpohic for an application that relies on the exact structure – Sammaye Aug 30 '13 at 08:54
1

If both your collections are in the same database, I believe you're looking for renameCollection.

If not, you unfortunately have to do it manually, using a targeted mongodump / mongorestore command:

mongodump -d your_database -c mongo_collection
mongorestore -d your_database -c mongo_his_collection dump/your_database/mongo_collection.bson

Note that I just typed these two commands from the top of my head without actually testing them, so do make sure you check them before running them in production.

[EDIT]: sorry, I just realised that this was something you needed to do on a regular basis. In that case, mongodump / mongorestore probably isn't the best solution. I don't see anything wrong with your solution - it would help if you edited your question to explain what you mean by "it breaks".

Nicolas Rinaudo
  • 6,068
  • 28
  • 41
1

The query breaks because you are not limiting the find(). When you create a cursor on the server mongod will try to load the entire result set in memory. This will cause problems and/or fail if your collection is too large.

To avoid this use a skip/limit loop. Here is an example in Java:

long count = 0

while (true) {
    MongoClient client = new MongoClient();
    DBCursor = client.getDB("your_DB_name").getCollection("mongo_collection").find().sort(new BasicDBObject("$natural", 1)).skip(count).limit(100);

    while (cursor.hasNext()) {
        client.getDB("your_DB_name").getCollection("mongo_his_collection").insert(cursor.next());
        count++;
    }
}

This will work, but you would get better performance by batching the writes as well. To do that build an array of DBObjects from the cursor and write them all at once with one insert.

Also if the Collection is being altered while you are copying there is no guarantee that you will traverse all documents as some may end up getting moved if they increase in size.

Rick Houlihan
  • 246
  • 1
  • 3
0

You can try mongodump & mongorestore.

metalfight - user868766
  • 2,722
  • 1
  • 16
  • 20
0

You can use renameCollection to do it directly. Or if on different mongods, use cloneCollection.

References:

Scott Stafford
  • 43,764
  • 28
  • 129
  • 177