0

I need your help to automate branching a dictionary. I am iterating through lines of a big dataset with over 100 million lines. I split each line and select the parts of interest:

#I quickly wrote this to create a database so that you can test the script
fruits = ['apple','banana','citron']
cars = ['VW', 'Opel', 'Fiat']
countries = ['Bosnia','Egypt','USA','Ireland']
genomic_contexts = ['CDS', 'UTR5', 'UTR3', 'Intron']
database=[[fruits[random.randint(0,2)],cars[random.randint(0,2)],
countries[random.randint(0,3)],genomic_contexts[random.randint(0,3)]] 
for x in range(100)]

A_B_C_D_dict = {}
for line in database:
    line = line.split(',')
    A = line[0]
    B = line[1]
    C = line[2]
    D = line[3]
    #creating dict branch, if not existing yet and counting the combinations
    A_B_C_D_dict[A]=my_dict.get(A,{})
    A_B_C_D_dict[A][B]=my_dict[A].get(B,{})
    A_B_C_D_dict[A][B][C]=my_dict[A][B].get(C,{})
    A_B_C_D_dict[A][B][C][D]=my_dict[A][B][C].get(D,0)
    A_B_C_D_dict[A][B][C][D] += 1

Now I would like to define a function that does this automatically instead of always writing the branch manually (this should work for different branch lengths, not always 4)! My code should look like this:

for line in database:
    line = line.split(',')
    A = line[0]
    B = line[1]
    C = line[2]
    D = line[3]
    add_dict_branch('A_B_C_D_dict',0)
    A_B_C_D_dict[A][B][C][D] += 1

My try for such a function is the following but I might be humiliating myself:

def select_nest(dict_2,keys,last,counter=0):
    if last == 0:
        return dict_2
    if counter == last:
        return dict_2[globals()[keys[counter-1]]]
    else:
        return select_nested(
        dict_2[globals()[keys[counter-1]]],keys,last,counter+1)


def add_dict_branch(dict_1,end_type):
    if type(dict_1) != type(str()):
        raise KeyError(dict_1," should be string!")
    keys = dict_1.split('_')
    keys = keys[:len(keys)-1]

    for x in range(len(keys)):
        key = globals()[keys[x]]

        if x < len(keys)-1:
            select_nest(globals()[dict_1],keys,x)[key] = \
            select_nest(globals()[dict_1],keys,x).get(key,{})
        else:
            select_nest(globals()[dict_1],keys,x)[key] = \
            select_nest(globals()[dict_1],keys,x).get(key,end_type)

Any comments would be great, either pointing out my mistake or suggestions to new approaches. I really need to write my code with respect to performance. If it is slow I can't use it because of the several million iterations.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
KaPy3141
  • 161
  • 14
  • 1
    *"pointing out my mistake"* - what mistake? Is that code not working? Give a [mcve]. – jonrsharpe Jan 11 '19 at 13:39
  • What is the structure of your `my_dict`? As for your last question about `get()` vs querying the dict keys and then inserting, the first result from googling can be seen here: https://stackoverflow.com/a/37968524/9462009 – Endyd Jan 11 '19 at 13:52
  • Before I suggest a solution with nested dicts, how frequently do you expect these keys to be repeated? Because if A is expected to be fairly unique, we're looking at millions of dictionaries at depth 1, each of which may have millions of dictionaries depending on how many unique Bs exist, and so on. What is the problem you're trying to solve? What does your data look like? – Reti43 Jan 11 '19 at 14:31
  • Thanks for the comments! The problems I am dealing with are mostly to plot for every [experimental setup] a graph showing the average [binding partners] of the [proteins] And every line would be a specific datapoint of 1 binding. So far a dictionary worked perfectly. dict[setup][protein][binding partners] Just I would like to have a function that creates the dictionaries for me so that I save time for future script writing. – KaPy3141 Jan 11 '19 at 14:55
  • @jonrsharpe: yes the add_dict_branch() function does not work. It only adds the first key (A) and then it breaks. I can work around that by just not using the function and writing the code to to add the dictionary branches manually. But I hate to do such repetitive things. I would rather spend time to write a function for that! – KaPy3141 Jan 11 '19 at 15:03
  • I'll just go ahead and assume the number of experimental setups and other variables, i.e., the number of unique keys for each depth, isn't some huge number. – Reti43 Jan 11 '19 at 15:04
  • @Reti43: Most keys range between 3-20 possibilities. Some are huge (thousands of possibilities) but normally I will have those as a set saved in the last key. (At the "end" of the branch) – KaPy3141 Jan 11 '19 at 15:08

2 Answers2

0

For a dictionary d = dict(), the following code

d.get(key, 0)
d[key] += 1

can be done in one line with a collections.defaultdict object.

d = defaultdict(int)
d[key] += 1

Effectively, if the key doesn't exist, it creates a key with a value based on the function passed (int() returns 0). If you want nested dictionaries where the last one is an integer counter, you'd want something like

d = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(int))))

This can be generalised in a function.

def generate_nested_dicts(depth):
    if depth < 1:
        raise ValueError('The structure must have depth >= 1 but got {}'.format(depth))
    s = 'defaultdict(int)'
    for i in range(depth - 1):
        s = 'defaultdict(lambda: {})'.format(s)
    return eval(s)

And to update a key

def update_key(d, keys):
    for key in keys[:-1]:
        d = d[key]
    d[keys[-1]] += 1

Then you can use it like this.

d = generate_nested_dictionaries(4)
for line in dataset:
    update_key(d, line)
Reti43
  • 9,656
  • 3
  • 28
  • 44
0

I manage to find the bug in the script and I also made it more efficient by directly handing the dictionary to the function instead of it having to repetitively search it through the globals() function. Here I present my working script to add dictionary branches in case someone finds this threat:

def select_nest(dict_2,keys,last,counter=0):

    if last == 0:
        return dict_2
    if counter == 0:
        counter += 1
    if counter == last:
        return dict_2[keys[counter-1]]
    else:
        return select_nest(dict_2[keys[counter-1]],keys,last,counter+1)

def add_dict_branch(dict_1,keys,end_type):

    for count,key in enumerate(keys):
        if count < len(keys)-1:
            if key not in select_nest(dict_1,keys,count).keys():
                select_nest(dict_1,keys,count)[key] = {}
        else:
            if key not in select_nest(dict_1,keys,count).keys():
                select_nest(dict_1,keys,count)[key] = end_type

And here an example how to use it:

add_dict_branch(transcript_dict1,[method,RBP,RNA,transcript],0)
transcript_dict1[method][RBP][RNA][transcript] += 1
KaPy3141
  • 161
  • 14