3

I have a list of 4-grams that I want to populate a dictionary object/shevle object with:

['I','go','to','work']
['I','go','there','often']
['it','is','nice','being']
['I','live','in','NY']
['I','go','to','work']

So that we have something like:

four_grams['I']['go']['to']['work']=1

and any newly encountered 4-gram is populated with its four keys, with the value 1, and its value is incremented if it is encountered again.

hmghaly
  • 1,411
  • 3
  • 29
  • 47
  • Duplicate of http://stackoverflow.com/questions/2600790/multiple-levels-of-collection-defaultdict-in-python – Maciej Gol Dec 26 '13 at 15:09
  • would this work with shelf object? – hmghaly Dec 26 '13 at 15:10
  • and it doesn't work on multiple levels, only two... this is useful of course but completely different thing, appreciate if you remove the duplicate tag – hmghaly Dec 26 '13 at 15:14
  • This can be easily implemented by subclassing the `Shelve` object's `__getitem__` to add a `defaultdict` object on `KeyError`. You can make it work with 4-long tuples either. – Maciej Gol Dec 26 '13 at 15:18
  • this sounds interesting, but how can I do this? – hmghaly Dec 26 '13 at 15:20
  • isn't this trivial to do? or am i missing the complexity? can't you just populate the dictionary using a 2D loop? – Enermis Dec 26 '13 at 15:23
  • trivial for you may be complex for others :) – hmghaly Dec 26 '13 at 15:25
  • i wasn't trying to be offensive. I was merely trying to understand the question. You want to populate an n-dimensional array using n-tuples? is that correct? – Enermis Dec 26 '13 at 15:27
  • basically yes, and want to be able to store it in a persistent form (shelve) as I'm populating it – hmghaly Dec 26 '13 at 15:28
  • and I wouldn't want to use writeback for shelve, so it doesn't seem straightforward as to how to create the keys and increment the values using this type of object – hmghaly Dec 26 '13 at 15:29
  • You can't avoid `writeback` in some form as you'll need to dump whole root-level key into the underlying file since the shelf is only aware of it's immediate keys, and thus you'll need to dump whole keys at once. – Maciej Gol Dec 26 '13 at 15:43
  • I can use temp variable to store the keys and their values, and this works nicely for single key shelves, but for more keys it is not, that is why I am asking – hmghaly Dec 26 '13 at 15:47
  • @hmghaly : Landed here via "shelve" .Wondering if you could add more on your performance with using both defaultdict & shelve, as per Robert's answer below. I am handling a similar usecase, with HUGE dictionary(that will fail obviously without shelve). Side note-: Statistically asking, why would you want to keep the order of the words ? gibberish, but should we not allow for "I work to go" , or is the "order" an intentional assumption ? – ekta Aug 30 '14 at 18:50

2 Answers2

1

You can just create a helper method that inserts the elements into a nested dictionary one at a time, each time checking to see if the desired sub-dictionary already exists or not:

dict = {}
def insert(fourgram):
    d = dict    # reference
    for el in fourgram[0:-1]:       # elements 1-3 if fourgram has 4 elements
        if el not in d: d[el] = {}  # create new, empty dict
        d = d[el]                   # move into next level dict

    if fourgram[-1] in d: d[fourgram[-1]] += 1  # increment existing, or...
    else: d[fourgram[-1]] = 1                   # ...create as 1 first time

You can populate it with your dataset like:

insert(['I','go','to','work'])
insert(['I','go','there','often'])
insert(['it','is','nice','being'])
insert(['I','live','in','NY'])
insert(['I','go','to','work'])

after which, you can index into dict as desired:

print( dict['I']['go']['to']['work'] );     # prints 2
print( dict['I']['go']['there']['often'] ); # prints 1
print( dict['it']['is']['nice']['being'] ); # prints 1
print( dict['I']['live']['in']['NY'] );     # prints 1
Myk Willis
  • 12,306
  • 4
  • 45
  • 62
  • Yes, you can just initialize `dict` as `dict = shelve.open('file', writeback=True)` instead, and it will work fine. – Myk Willis Dec 26 '13 at 15:49
  • yes, the problem with writeback=True is that we will run into memory problems if the dataset is large (which is the case here), so I want to avoid that – hmghaly Dec 26 '13 at 15:50
1

You could do something like this:

import shelve

from collections import defaultdict

db = shelve.open('/tmp/db')

grams = [
    ['I','go','to','work'],
    ['I','go','there','often'],
    ['it','is','nice','being'],
    ['I','live','in','NY'],
    ['I','go','to','work'],
]

for gram in grams:
    path = db.get(gram[0], defaultdict(int))

    def f(path, word):
        if not word in path:
            path[word] = defaultdict(int)
        return path[word]
    reduce(f, gram[1:-1], path)[gram[-1]] += 1

    db[gram[0]] = path

print db

db.close()
Robert Kajic
  • 8,689
  • 4
  • 44
  • 43
  • looks also good for a solution, but how can I use it to update a shelve object? – hmghaly Dec 26 '13 at 15:39
  • Does it need to be a Shelve? Could you perhaps pickle the dictionary / dump it to json, and then save it to a file yourself? – Robert Kajic Dec 26 '13 at 15:50
  • yes, as cannot keep opening and closing pickle files everytime I run this code (which I am using it for a very large dataset), or writing and reading from a file, that's why shelve (without writeback) is a very good solution, the point is how to make it work with updating multiple keys (which I think is possible with some use of temp variables but still cannot figure out how to do this exactly) – hmghaly Dec 26 '13 at 15:54
  • Okay, I've updated my answer. I hope it's enough to get you started. – Robert Kajic Dec 26 '13 at 16:07