Which dataset is best for storing complex data structure? Python

Question

I have the following data structure:

One sample contains 5 vectors. In all the vectors there are elements from the same classes but the classes are different between the vectors. These vectors are really big with thousands of elements. I usually have several (5-10) samples.

At the moment I have a vector for every sample what contains the vectors of the classes. And I store the vectors of the samples in a vector so I can manage all the samples at once.

I use vector cause while filling my dataset I use .append(). But later on I won't change the data just iterate through and analyze it.

My problem is with memory. Now the dataset eats a lot of it. So some optimization would be great.

That's why I ask if there is a better way to store this dataset?

I've heard that array is better if I don't change my data. Is it worth maybe to convert everything to array after loaded them as vector? What do you recommend?

For example, I show a dataset below similar to mine:

class van:
    #some data
    pass;
class bus:
    #some more data
    pass;
class motorcycle:
    #something else
    pass;

all_data = []
for i in range(7):
    vans = [van() for i in range(5000)]
    buses = [bus() for i in range(2000)]
    mcycles = [motorcycle() for i in range(3000)]
    dataset = [vans, buses, mcycles]
    all_data.append(dataset)

Why don’t you use numpy arrays? Also if you really need classes you might want to use `__slots__` to define the list of attributs your class can have. This will reduce memory consumption. — tupui, Jan 19 '18 at 12:53
`__slots__` helped quite much! Around 20% less memory usage. Thanks @Y0da — Csenger Kovácsházi, Jan 19 '18 at 14:12

score 0 · Answer 1 · answered Jan 19 '18 at 13:19

0

if you want to keep your current code intact (minimizing work), you may consider replacing lists with lazylist. lazylist@github

answered Jan 19 '18 at 13:19

internety

364
3
8

score 0 · Accepted Answer · answered Jan 19 '18 at 14:17

Considering that you need to keep the class structure, you can have a drastic improvement of the memory consumption just by using __slots__. When a new object will be created, only attribute defined in this list will be allowed. But this is more efficient. Checkout this question.

Another approach would be to use structured array from numpy. But this depends on the exact nature of your data.

Which dataset is best for storing complex data structure? Python

2 Answers2