1

I am trying to split up a large dictionary into n number of smaller dictionaries. Each dictionary entry contains a webaddress, and the point of splitting up the dictionary is so webscraping of these addresses can be spread over several computers.

The dictionary is in the form:

{
  u'25637293': 
    [u'Methyldopa',u'http://www.ncbi.nlm.nih.gov/pubmed/25637293', 43579], 
  u'25672666':
    [u'Furosemide', u'http://www.ncbi.nlm.nih.gov/pubmed/25672666', 40750]
}

with 13000 key/value pairs.

The last entry in the value is an index from 0 to 13000

This is what I have tried. (although I have probably overcomplicated things)

1) Create a list of 13000 values

2) Split this by n amount

3) Ensure that the dictionary has an entry of 1-13000

4) Iterate through the list. if (i in list == the entry of dictionary) then the web address can be extracted for scraping (last part not included in the code)

    smalldict={}
    #create a list from 0-13000 and split it into dictionaries of n number 
    def chunks(l, n):
        n = max(1, n)
        return [l[i:i + n] for i in range(0, len(l), n)]

    #here I am inserting the values for the number of computers and how many dictionaries the big dictionary needs to be divided into
    number = len(dictionary)
    #entry for the number of dictionaries to divide it into
    computers =4
    #this is the 'name' of the computer that is running the script
    compno = 1
    #-1 because of 0 indexing
    compm=compno-1
    listlength = number/computers
    divider= range(number)
    division = chunks(divider, listlength)

    for entry in dictionary:
        #get all of the values from the value
        value=dictionary[entry]
        #specify the smaller dictionary that will be created
        for i in division[compm]:
            #if the number up to 13000 is in the dictionary
            if i == value[2]
                smalldict[value[1]]=value

I would have thought len(smalldict) would have been 13000/4 (since len(dictionary) is 13000, and len(division[0]) when there is only one list in the division) but it returns a few hundred only. It is not splitting as it supposed to. I've been working on it for a lot of days. Can anyone help?

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
user1745691
  • 305
  • 2
  • 5
  • 12
  • Sounds like you need a DB. I recommend you read about [SQLite](https://docs.python.org/2/library/sqlite3.html). Although it is not your only option, it's very simple to use with Python. – pzp Mar 15 '15 at 02:25
  • I'm a bit confused by the logic in the last section. Are you just trying to arbitrarily split the dictionary into equally-sized chunks or are there extra restrictions on which items go into which chunks? – Andrew Magee Mar 15 '15 at 02:26
  • I'm trying to split the dictionary into roughly equal chunks - there is no restriction on which items go into which chunks. – user1745691 Mar 15 '15 at 02:35
  • 2
    Discussed here: https://stackoverflow.com/questions/22878743/how-to-split-dictionary-into-multiple-dictionaries-fast – Boa Mar 15 '15 at 02:36

2 Answers2

3

Easy. Do it this way. For example, we have a dictionary with five keys and we want to split it into two dictionaries with even sizes.

>>> d = {'key1': 1, 'key2': 2, 'key3': 3, 'key4': 4, 'key5': 5}
>>> d1 = dict(d.items()[len(d)/2:])
>>> d2 = dict(d.items()[:len(d)/2])
>>> print d1
{'key1': 1, 'key5': 5, 'key4': 4}
>>> print d2
{'key3': 3, 'key2': 2}
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
  • 1
    This did not work for me, but i liked this idea more than the selected answer. My modified version: d1 = {k:d[k] for k in list(d.keys())[:int(len(d)*0.5)]} – negfrequency May 07 '20 at 14:33
1

This gist worked for me: https://gist.github.com/miloir/2196917

I re-wrote it with itertools.cycle.

import itertools
def split_dict(x, chunks):      
    i = itertools.cycle(range(chunks))       
    split = [dict() for _ in range(chunks)]
    for k, v in x.items():
        split[next(i)][k] = v
    return split
dwagnerkc
  • 488
  • 4
  • 8