I have a system writing logs into mongodb (about 1kk logs per day). On a weekly basis I need to calculate some statistics on those logs. Since the calculations are very processor and memory consuming I want to copy collection I'm working to powerful offsite machine. How do I keep offsite collections up to date without copying everything? I modify offsite collection, by storing statistic within its elements i.e. adding fields {"alogirthm_1": "passed"}
or {"stat1": 3.1415}
. Is replication right for my use case or I should investigate other alternatives?

- 1
- 1

- 2,180
- 1
- 29
- 48
-
1What are 1kk logs per day? 1M log lines per day? – Stennie Feb 17 '14 at 02:04
-
How are you gonna deal with your log data? Map/Reduce or you have an application for the calculation? – yaoxing Feb 17 '14 at 03:05
-
@Stennie, yes, exactly. – Moonwalker Feb 17 '14 at 11:55
-
@yaoxing I have an application for the calculations already. – Moonwalker Feb 17 '14 at 12:03
-
@Moonwalker Then the solution 1 of my answer would be applicable. You need to read document about aggregation framework and API docs of your language. This will put stress on your MongoDB server. While if you go through solution 3, the stress would be on your application server, and the logic would be more complex. – yaoxing Feb 18 '14 at 02:02
1 Answers
As to your question, yes, replication does partially resolve your issue, with limitations. So there are several ways I know to resolve your issue:
The half-database, half-application way.
Replication keeps your data up to date. It doesn't allow you to modify the secondary nodes (which you call "offsite collection") however. So you have to do the calculation on the secondary and write data to the primary. You need to have an application running aggregation on the secondary, and write the result back to it's primary.
This requires that you run an application, PHP, .NET, Python, whatever.full-server way
Since you are going to have multi-servers any way, you can consider using Sharding for the faster storage and directly do the calculation online. This way you don't even need to run an application. The Map/Reduce do the calculation and write output into an new collection. I DON'T recommend this solution though because of the Map/Reduce performance issue of current versions.The full-application way
Basically you still use an replication for reading, but the server doesn't do any calculations except querying data. You can use capped collection or TTL index for removing expired data, and you just enumerate data one by one in your application and calculation by yourself.