0

I have a couchDB with over 1 million entries spread over a few databases. I need to draw random samples from such that I have a record of the members of each sample. To that end, and following this question I want to add a field with a random number to every document in my couchDB.

Code to add a random number

def add_random_fields():
    from numpy.random import rand
    server = couchdb.Server()
    databases = [database for database in server if not database.startswith('_')]
    for database in databases:
        print database
        for document in server[database]:
            if 'results' in server[database][document].keys():
                for tweet in server[database][document]['results']:
                    if 'rand_num' not in tweet.keys():
                        tweet['rand_num'] = rand()
                        server[database].save(tweet)

This fails because I do not have enough RAM to hold a copy of all my CouchDB databases.

First attempt- load databases in chunks

Following this question.

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

# ..Just showing relevant part of add_random_fields()

   #..
        chunk_size=100
        for tweet in grouper(server[database][document]['results'],chunk_size):

If I were iterating over a large list in python, I would write a generator expression. How can I do that in couchdb-python? Or, is there a better way?

Community
  • 1
  • 1
mac389
  • 3,004
  • 5
  • 38
  • 62
  • 1
    Would just writing a couch view to return documents with random number be viable? Then just access that view - ie, if you wanted a 1 in 10 sample, then return a number between 1 and 10, with a document idea, which'll automatically be sorted, and then use itertools.groupby in Python to retrieve full document where needed ? – Jon Clements Dec 15 '12 at 15:33
  • @JonClements I want to be able to recreate the same random sample. I thought associating a random number with each document in the database was the most direct way to do that. How would I recreate a specific sample, using your approach? – mac389 Dec 15 '12 at 15:42

1 Answers1

-1

Use a generator to avoid loading large lists into memory

From Marcus Brinkmann I found code to make a generator over all documents in a couchDB database. Let that generator be called couchdb_pager.

The original function becomes the following.

def add_random_fields():
    from numpy.random import rand
    server = couchdb.Server()
    databases = [database for database in server if not database.startswith('_')]
    for database in databases:
        for document in couchdb_pager(server[database]):
            if 'results' in server[database][document]:
                for tweet in server[database][document]['results']:
                    if tweet and 'rand_num' not in tweet:
                        print document
                        tweet['rand_num'] = rand()
                        server[database].save(tweet)
mac389
  • 3,004
  • 5
  • 38
  • 62