Hashable container for million items and modifications in each iteration

Question

I'm currently dealing with dictionary filled by a dozen items which grow until a dozen millions of dictionary's items after few iterations. Fundamentally my item is defined by several ID, value, and characteristics. I create my dict with data in JSON I gather from a SQL server.

The operation I execute are for example:

get SQL results in JSON
search item where 'id1' and/or 'id2' which are identical
merge all items with same 'id1' by summing float('value')

See an example of what looks like my dict:

[
   {'id1':'01234-01234-01234',
    'value':'10',
    'category':'K'}
...
   {'id1':'01234-01234-01234',
    'value':'5',
    'category':'K'}
...
]

What I would like to get looks like:

[
...
   {'id1':'01234-01234-01234',
    'value':'15',
    'category':'K'}
...
]

I could use dict of dicts instead:

{
  '01234-01234-01234': {'value':'10',
                        'categorie':'K'}
...
  '01234-01234-01234': {'value':'5',
                        'categorie':'K'}
...
}

and get:

  {'01234-01234-01234': {'value':'15',
                        'categorie':'K'}
  ...
  }

I've just got dedicated 4Go in Ram and millions of dicts in one dictionary on 64bit architecture I would like to optimise my code and my operations in time and RAM. Are there tricks or better containers than dictionary of dictionaries to realise these kind of operations? Is it better to create a new object which erase the first one for each iteration or change the hashable object itself?

I'm using Python 3.4.

EDIT: simplified the question in one question about the value. The question is similar to How to sum dict elements or Fastest way to merge n-dictionaries and add values on 2.6, but in my case I've string in my dict.

EDIT2: for the moment, the best performances are get thanks to this method:

def merge_similar_dict(input_list=list):
    i=0
    #sorted the dictionnary of exchanges with the id.
    try:
        merge_list = sorted(deepcopy(input_list), key=lambda k: k['id'])
        while (i+1)<=(len(merge_list)-1):
            while (merge_list[i]['id']==merge_list[i+1]['id']):
                merge_list[i]['amount'] = str(float(merge_list[i]['amount']) + float(merge_list[i+1]['amount']))
                merge_list.remove(merge_list[i+1])
                if i+1 >= len(merge_list):
                    break
                else:
                    pass
            i += 1
    except Exception as error:
        print('The merge of similar dict has failed')
        print(error)
        raise
        return merge_list
    return merge_list

When I get dozen thousand dicts in list, it begins to become very long (several minutes).

I don't think it's clear what you're trying to do? Can you be more specific? Do you have any code you can show? — Cyphase, Aug 24 '15 at 22:48
From your question it's not clear at all what sort of dataset you're dealing with, or what you're trying to do with it. Is it a flat structure, or do the nodes reference one another? Is it a tree? Or some more complicated graph? It might be helpful if you explained, in real terms, what the data represent and what you want to compute from it. — ali_m, Aug 25 '15 at 00:33
The outer container in your sample isn't a dictionary. It looks (in Python) like a set of dictionaries. There aren't any keys in outer layer. — hpaulj, Aug 25 '15 at 01:30
In my case, my data are data on materials which are composed by chemical products which are composed by molecules which are composed by atoms. A chemical product can composed by several sub chemical product less complex. I want to know the components of my products and total amount of each components at different level: chemical products, molecules, atoms. It looks like a tree or a directed graph and there are sometimes loop. — Cyril, Aug 25 '15 at 06:37
The data in the SQL look like: material_Id, component_Id, amount (value), linked_to_material_Id. — Cyril, Aug 25 '15 at 06:54

Hashable container for million items and modifications in each iteration

0 Answers0