I need your help to automate branching a dictionary. I am iterating through lines of a big dataset with over 100 million lines. I split each line and select the parts of interest:
#I quickly wrote this to create a database so that you can test the script
fruits = ['apple','banana','citron']
cars = ['VW', 'Opel', 'Fiat']
countries = ['Bosnia','Egypt','USA','Ireland']
genomic_contexts = ['CDS', 'UTR5', 'UTR3', 'Intron']
database=[[fruits[random.randint(0,2)],cars[random.randint(0,2)],
countries[random.randint(0,3)],genomic_contexts[random.randint(0,3)]]
for x in range(100)]
A_B_C_D_dict = {}
for line in database:
line = line.split(',')
A = line[0]
B = line[1]
C = line[2]
D = line[3]
#creating dict branch, if not existing yet and counting the combinations
A_B_C_D_dict[A]=my_dict.get(A,{})
A_B_C_D_dict[A][B]=my_dict[A].get(B,{})
A_B_C_D_dict[A][B][C]=my_dict[A][B].get(C,{})
A_B_C_D_dict[A][B][C][D]=my_dict[A][B][C].get(D,0)
A_B_C_D_dict[A][B][C][D] += 1
Now I would like to define a function that does this automatically instead of always writing the branch manually (this should work for different branch lengths, not always 4)! My code should look like this:
for line in database:
line = line.split(',')
A = line[0]
B = line[1]
C = line[2]
D = line[3]
add_dict_branch('A_B_C_D_dict',0)
A_B_C_D_dict[A][B][C][D] += 1
My try for such a function is the following but I might be humiliating myself:
def select_nest(dict_2,keys,last,counter=0):
if last == 0:
return dict_2
if counter == last:
return dict_2[globals()[keys[counter-1]]]
else:
return select_nested(
dict_2[globals()[keys[counter-1]]],keys,last,counter+1)
def add_dict_branch(dict_1,end_type):
if type(dict_1) != type(str()):
raise KeyError(dict_1," should be string!")
keys = dict_1.split('_')
keys = keys[:len(keys)-1]
for x in range(len(keys)):
key = globals()[keys[x]]
if x < len(keys)-1:
select_nest(globals()[dict_1],keys,x)[key] = \
select_nest(globals()[dict_1],keys,x).get(key,{})
else:
select_nest(globals()[dict_1],keys,x)[key] = \
select_nest(globals()[dict_1],keys,x).get(key,end_type)
Any comments would be great, either pointing out my mistake or suggestions to new approaches. I really need to write my code with respect to performance. If it is slow I can't use it because of the several million iterations.