0

I'm writing an octree type thing in Python, and currently all it's able to store is an ID. However, I'd like to change it so it can store any number of things, and I'm not sure which way would be better taking into account this could be for thousands of values.

Would storing it as a normal list be fine, and hard code in the indexes as you add things to the code, or would it be better to store as a dictionary so it's easier to get the values you want?

  1. =[ID, stuff1, stuff2...]
  2. ={'ID': num, 'stuff1': something, 'stuff2': something}

If you want the ID with the first way, you use voxel_info[0], but then as you add more values it may get more complicated, such as owner = voxel_info[4]; importantText = voxel_info[5] (where you have to remember which index relates to what at all points in the code), or with the dictionary way, it'd be a lot easier to get the values if more things were added.

The dictionary way seems more usable, but it also seems like it'd slow the code down and use more memory as it's storing text keys for thousands of dictionaries. Which way would you probably recommend?

For an idea of how the data is stored, it's like this. Items will be grouped if all of the block info is the same, so you get the different depths. The BLOCKINFO part is where the info is stored, which is what I'm asking about being a list or dictionary.

OctreeData={ Depth: 2
             Nodes: (1,1,1): {Depth:1
                            Nodes: (1,1,1): BLOCKINFO
                                   (1,1,-1): {Depth: 0
                                            Nodes: (1,1,1): BLOCKINFO
                                                   (1,1,-1): BLOCKINFO
                                                   (1,-1,-1): BLOCKINFO
                                                   etc
                                   (1,-1,-1): BLOCKINFO
                                   (-1,-1,-1): BLOCKINFO
                                   (-1,-1,1): {Depth: 0
                                               Nodes: etc
                                   etc
                    (1,1,-1): BLOCKINFO
                    etc
           }
Peter
  • 3,186
  • 3
  • 26
  • 59
  • 3
    Use a class or a [`namedtuple`](https://docs.python.org/2/library/collections.html#collections.namedtuple). – Peter Wood Apr 10 '15 at 15:40
  • 1
    _"but it also seems like it'd slow the code down and use more memory as it's storing text keys for thousands of dictionaries."_ Don't worry about that until you actually implement it and observe a performance problem. Anything sooner is premature optimization. Anyway, thousands of keys is like ten kilobytes, which isn't worth fussing over. – Kevin Apr 10 '15 at 15:41
  • with dictionary you have more control to your data. – Sara Santana Apr 10 '15 at 15:43
  • 1
    There's just no one-size-fits-all answer to this question, but from a semantical POV using heterogeneous lists with hard-coded indexes is wrong - that's what tuples are for (also they are cheaper than lists), and you can used named constants for the indexes (ie `ID=0; 'stuff1'=1;` etc). But Kevin's comment is mostly spot on: that's still premature optimisation. – bruno desthuilliers Apr 10 '15 at 15:43
  • This question isn't really suitable for SO... a `dict` beats a `list` here, but a `dict` is really for arbitrary keys. Since your set of keys is fixed (and only change when you update the program), a `namedtuple` is more suitable. And `namedtuple`s are very efficient. – PM 2Ring Apr 10 '15 at 15:45
  • You don't have to worry about using lots of memory storing keys for each dictionary, you could just define all the keys some where and then use references to them (eg. `ID_KEY = "ID"`, and then `dictionary[ID_KEY] = "foo"`). In fact, depending on the circumstances Python [may do this itself](http://stackoverflow.com/questions/2123925/when-does-python-allocate-new-memory-for-identical-strings). As Kevin says though, don't optimize prematurely anyway. – stonesam92 Apr 10 '15 at 15:47
  • Alright thanks guys, if named tuples are efficient then that sounds like the best way. As to premature optimising, surely it's better to attempt at the start as opposed to having to fix it later? :) – Peter Apr 10 '15 at 16:18

2 Answers2

2

It really depends on how big your list of items is and how you want to access them

[[ID,stuff1,stuff2,stuff3],[ID,stuff1,stuff2,stuff3],...]

will take O(N) to search through

likewise

[{'ID':'id','stuff':[1,2,3]},{'ID':'id','stuff':[1,2,3]}]

will take O(n) to search through

{'ID1':['stuff1','stuff2','stuff3'],
 'ID2':['stuff1','stuff2','stuff3'],
 'ID3':['stuff1','stuff2','stuff3']}

however will only take O(1) to search (assuming you search on id)

but really I would recommend using a database(+ an orm if you are smart)

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • Thanks, getting to those values isn't the hard part, as with it being an octree, the dictionary divides down recursively until it reaches the end point where a block is (which needs this information), so it's more about the best way of writing and storing all these end points :) – Peter Apr 10 '15 at 16:08
  • if search time is not an issue just use whatever you feel fits the model the best ... a dict,a list , a custom class, etc – Joran Beasley Apr 10 '15 at 16:20
  • Yeah, it was more wondering if dictionaries would severely slow down things, but people have suggested named tuples which sound like the best bet, since I don't need to change anything once it's written :) – Peter Apr 10 '15 at 17:37
  • namedtuples are a great solution ... but all things being equal they speed should be about the same for all 3 ... (that is a dict might be marginally slower, but not enough to notice unless you are dealing with very large datasets) – Joran Beasley Apr 10 '15 at 18:23
  • Well, I worked them in and found a bit of a problem. Using 250,000 points and only storing the ID as a tuple, the efficiency of storing it (calculated from the length of the picked input compared to the length of the pickled dictionary) is around 628%, or 4450% when compressed. However, using the named tuple, the efficiency drops to 256%, or 1667% when compressed, so there is quite a serious difference. I'm not sure if there's some flaw in cPickle to do with that, but that's likely the way I'll be saving the files so I guess I'll have to stick with plain tuples :) – Peter Apr 12 '15 at 12:30
1

There is nothing wrong with mixing both dicts and lists.

This can give you more elegant solutions than just using one structure.

my_data = [
    { 'ID' : 1, 'stuff' : ['stuff1', 'stuff2', 'stuff3']},
    { 'ID' : 2, 'stuff' : ['stuff1', 'stuff2', 'stuff3']},
    { 'ID' : 3, 'stuff' : ['stuff1', 'stuff2', 'stuff3']},
]

or...

my_data = {
    'ID1' : [ 'stuff1', 'stuff2', 'stuff3'],
    'ID2' : [ 'stuff1', 'stuff2', 'stuff3'],
    'ID3' : [ 'stuff1', 'stuff2', 'stuff3'],
}

Personally, I would go for option 1. Why? Each dictionary inside the list has the same structure. This makes it cleaner and easier to understand.

As for search you can make use of filter:

from functools import partial

def custom_filter(data, filter_value=None):
    if data['ID'] == filter_value:
        return True
    return False

filter_on = 1
print filter(partial(custom_filter, filter_value=filter_on), my_data)
Matt Seymour
  • 8,880
  • 7
  • 60
  • 101
  • Not sure I understand your argument for option 1, both would provide consistent structure and option 2 allows you to access stuff for a specific ID in `O(1)` time, vs search a list of ID `O(n)` – AChampion Apr 10 '15 at 16:05
  • Thanks, the stuff1, stuff2, etc was to just mean different values haha, with it being stored in an octree, it's all stored recursively, so the values are the end point, and all separate to each other :) I'll update the question haha – Peter Apr 10 '15 at 16:10