0

I have a dictionary users with 1748 elements as (showing only the first 12 elements)-

defaultdict(int,
            {'470520068': 1,
             '2176120173': 1,
             '145087572': 3,
             '23047147': 1,
             '526506000': 1,
             '326311693': 1,
             '851106379': 4,
             '161900469': 1,
             '3222966471': 1,
             '2562842034': 1,
             '18658617': 1,
             '73654065': 4,})  

and another dictionary partition with 452743 elements as(showing first 42 elements)-

{'609232972': 4,
 '975151075': 4,
 '14247572': 4,
 '2987788788': 4,
 '3064695250': 2,
 '54097674': 3,
 '510333371': 0,
 '34150587': 4,
 '26170001': 0,
 '1339755391': 3,
 '419536996': 4,
 '2558131184': 2,
 '23068646': 6,
 '2781517567': 3,
 '701206260771905541': 4,
 '754263126': 4,
 '33799684': 0,
 '1625984816': 4,
 '4893416104': 3,
 '263520530': 3,
 '60625681': 4,
 '470528618': 3,
 '4512063372': 6,
 '933683112': 3,
 '402379005': 4,
 '1015823005': 2,
 '244673821': 0,
 '3279677882': 4,
 '16206240': 4,
 '3243924564': 6,
 '2438275574': 6,
 '205941266': 3,
 '330723222': 1,
 '3037002897': 0,
 '75454729': 0,
 '3033154947': 6,
 '67475302': 3,
 '922914019': 6,
 '2598199242': 6,
 '2382444216': 3,
 '1388012203': 4,
 '3950452641': 5,}

The keys in users(all unique) are all in partition and also are repeated with different values(and also partition contains some extra keys which is not of our use). What I want is a new dictionary final which connects the keys of users matching with those of partition with the values of partition, i.e. if I have '145087572' as a key in users and the same key has been repeated twice or thrice in partition with different values as: {'145087572':2, '145087572':3,'145087572':7} then I should get all these three elements in the new dictionary final. Also I have to store this dictionary as a key:value RDD.
Here's what I tried:

user_key=list(users.keys())  
final=[]
for x in user_key:
    s={x:partition.get(x) for x in partition}
    final.append(s)  

After running this code my laptop stops to respond (the code still shows [*]) and I have to restart it. May I know that is there any problem with my code and a more efficient way to do this.

  • 1
    `dict` cannot hold duplicated keys. Any duplicated key will be overwritten by the last entry. – Chris Jun 28 '19 at 08:42
  • Besides the dupe-keys-problem, your code just creates a copy of hte entire dict for each user, which with those dict sizes surely results in a memory error or your computer starting to swap. – tobias_k Jun 28 '19 at 08:45
  • Where is your `partitions` data coming from? If it's JSON, you could use [this](https://stackoverflow.com/q/29321677/1639625) – tobias_k Jun 28 '19 at 08:54
  • No the partition data is a pickle file – Kriti Arora Jun 28 '19 at 08:56
  • and rather storing the result into a dictionary is there any other way to achieve what I want? – Kriti Arora Jun 28 '19 at 09:00
  • It is is a pickle file, i.e. a real serialized Python dictionary, then how can it contain duplicate keys? – tobias_k Jun 28 '19 at 09:58

1 Answers1

0

First dictionary cannot hold duplicate keys, duplicate key's value will be ovewritten by the last value of same key.
Now lets analyze your code

user_key=list(users.keys()) # here you get all the keys say(1,2,3)
final=[]
for x in user_key: #you are iterating over the keys so x will be 1, 2, 3
    s={x:partition.get(x) for x in partition} #This is the reason for halting

''' breaking the above line this is what it looks like.
    s = {} 
    for x in partition:
        s[x] = partition.get(x)
     isn't the outer forloop and inner forloop is using the same variable x
     so basically instead of iterating over the keys of users you are 
     iterating over the keys of partition table, 
     as x is updated inside inner foorloop(so x contains the keys of partition 
     table).
     '''
    final.append(s)

Now the reason for halting is (say you have 10 keys in users dictionary).
so outer forloop will iterate 10 times and for the 10 times
Inner forloop will iterate over whole partition keys and make a copy
which is causing memory error and eventually your system gets hung due to out of memory.
I think this will work for you
store partition data in a python defaultdict(list)

from collections import defaultdict
user_key = users.keys()
part_dict = defaultdict(list)
# partition = [[key1, value], [key2, value], ....] 
# store your parition data in this way (list inside list)
for index in parition:
    if index[0] not in part_dict:
        part_dict[index[0]] = index[1]
    else:
        part_dict[index[0]].append(index[1])
# part_dict = {key1:[1,2,3], key2:[1,2,3], key3:[4,5],....}
final = []
for x in user_keys:
   for values in part_dict[x]:
       final.append([x, values])
       # if you want your result of dictionary format(I don't think it's required) then you ca use
       # final.append({x:values})
       # final = [{key1: 1}, {key2: 2}, ....]
# final = [[key1, 1], [key1, 2], [key1, 3], .....]

The above code is not tested, some minor changes may be required

Achint
  • 48
  • 5