0

I'm attempting to work with the NYC 2013 Taxi trip data set in MongoDB. It has about 170 million records in several CSV files, which I imported using mongoimport. The strings and numbers import as the correct type, but the pickup and drop-off time stamps are still strings. I know the usual way to fix this:

But that results in each of the 170M records being fetched from the database, and then the replacement date is sent back. At the current rate, it looks like this will take at least 2 days to convert both fields in all the records. The database is being housed on 4 shards, and those machines are barely doing anything during this process. Is there a faster way to do the conversion that uses more of the database resources?

Community
  • 1
  • 1
Tom Panning
  • 4,613
  • 2
  • 26
  • 47
  • you could run this process on each shard. make sure to turn off the balancer. run it in a shell that's local to the machine. You can even run multiple processes on each machine to subdivide and parallelize the work even further. – Asya Kamsky Nov 26 '14 at 16:15
  • @AsyaKamsky Just to be sure: if I turn off the balancer and run it locally, would that impact indexes or anything? I haven't indexed anything yet, but I want to understand any repercussions of doing this. – Tom Panning Nov 26 '14 at 16:17
  • it doesn't impact anything other than the documents being updated - I guess I'm not sure exactly what you are concerned about. This field is not indexed? And the string should be large enough that it should be an in-place update. So just rewriting each document in place, which means the more threads are doing this in parallel the faster it will finish. – Asya Kamsky Nov 26 '14 at 16:35

1 Answers1

0

As suggested in the comments, if the balancer turns off it is safe to run this kind of query directly on the shards. This is also discussed in this email thread. In my case, the query went from taking 2 days to 2 hours.

Tom Panning
  • 4,613
  • 2
  • 26
  • 47