I have 30 GB of twitter data stored in CouchDB. I am aiming to process each tweet in java but the java program is not able to hold such a large data at a time. In order to process the entire dataset, I am planning to divide my entire dataset into smaller ones with the help of filtered replication supported by CouchDb. But, as I am new to couchDB, I am facing a lot of problems in doing so. Any better ideas for doing it are welcome. Thanks.
Asked
Active
Viewed 252 times
2 Answers
1
You can always query couchdb for a dataset that is small enough for your java program, so there should be no reason to replicate subsets to smaller databases. See this stackoverflow answer for a way to get paged results from couchdb. You might even employ couchdb itself for the processing with map/reduce, but that depends on your problem.
0
Depending on the complexity of the queries and the changes you make when processing your data set you should be fine with one instance.
As the previous poster you can use paged results, I tend to do something different:
- I have a document for social likes. The latter always refers to a URL and I want to try and have an update at every 2-3 hours.
- I have a view that sorts URL's by the documents by the age of the last update request and the last update.
- I query this view so that I exclude the articles that had a request within 30 minutes or have been updated less than 2 hours ago.
- I use rabbit MQ when enqueuing the jobs and if these are not picked up within 30 minutes, they expire.

Hans
- 2,800
- 3
- 28
- 40