0

I am building a tagging model for natural language processing. Initially, the words of a sentence are tagged as a part of speech (like NN for noun), then the rules are applied that divide them into trees which are represented as nested lists. This process iterates many times until you get one node at the top level. I have a master list of all potential trees and I need to eliminate duplicate trees or the whole thing blows up in memory. Here is a small sample of what a list consists of. I need to make sure that each list in the list is unique as each iteration creates many branches.

[[('NP', [('PRP', 'I')]), ('VBD', 'ate'), ('DT', 'a'), ('NN', 'steak'), ('IN', 'with'), ('DT', 'a'), ('NN', 'knife'), ('.', '.')]
[('PRP', 'I'), ('VP', [('VBD', 'ate')]), ('DT', 'a'), ('NN', 'steak'), ('IN', 'with'), ('DT', 'a'), ('NN', 'knife'), ('.', '.')]
[('PRP', 'I'), ('VBD', 'ate'), ('NP', [('DT', 'a')]), ('NN', 'steak'), ('IN', 'with'), ('DT', 'a'), ('NN', 'knife'), ('.', '.')]
...]

I thought of using a set but lists aren't hashable. I tried using recursion and it runs out of memory. I thought about converting the lists to strings, using the string as a dictionary key and the list as the value, then iterating over and turning it back into a list again (or keep it as a dictionary?). Does anyone have a less hackish solution? I'm relatively new to Python so please provide an explanation of how your solution works.

I should clarify: the nested lists can be indefinitely deep. The tree structure is not known ahead of time but is built on the fly. Trying to build something like this -http://jos.oxfordjournals.org/content/25/4/345/F23.large.jpg but in the form of a nested list.

GenericJam
  • 2,915
  • 5
  • 31
  • 33
  • 1
    possible duplicate of [Python : How to remove duplicate lists in a list of list?](http://stackoverflow.com/questions/12198468/python-how-to-remove-duplicate-lists-in-a-list-of-list) – sdasdadas Oct 31 '13 at 21:42
  • if you wrap the tree data in a custom tree class can't you make the tree objects hashable? – Brad Allred Oct 31 '13 at 21:44

2 Answers2

0

You can hash tuples. This might solve some of your problem (although I am a little bit confused about your problem) This would let you store all existing tuples in a set (which is kind of a hashable list). That way when you try to create each new tupple you just call

if my_tuple in my_tuple_set:
    # the best way I could find to access the item in a tuple
    stored_tuple = my_tuple_set.pop(my_tuple) 
    my_tupple_set.add(stored_tuple)
    my_tuple = stored_tuple   # reset pointer to stored data

Kind of interesting having to deal with memory issues in python!

vitiral
  • 8,446
  • 8
  • 29
  • 43
0

Thanks to sdasdadas for pointing out the duplicate. I was able to (more or less) solve it by creating this function:

 def list_to_tuple(a_list):
    temp_list = []
    for item in a_list:
        if isinstance(item, list) or isinstance(item, tuple):
            temp_list.append(list_to_tuple(item))
        else:
            temp_list.append(item)
    return tuple(temp_list)

It takes an indefinitely deep nested list or tuples and gives back the equivalent thing in tuples. In another function, I pass this through a set to ensure unique values then it can go back to a list or whatever.

Thanks for your answers.

GenericJam
  • 2,915
  • 5
  • 31
  • 33
  • btw you can select your own answer as the `accepted answer`. This will let other users that the question has been answered. – jramirez Oct 31 '13 at 22:24
  • @jramirez It doesn't let me accept the answer for 2 days. I don't mind seeing what other people come up with as well. My solution isn't necessarily the most elegant so I would be happy to learn something from someone more experienced. – GenericJam Oct 31 '13 at 22:44
  • No worries I just thought I'd let you know. – jramirez Oct 31 '13 at 22:46