0

I need to load data from a csv file or an excel sheet (with rows and columns) into a two-dimensional python dict. For example, if the data in an excel sheet looks like this:

    name  age  gender location
1   Jim   18    male   China
2   Ross  18    male   China
3   Cara  19    female Japan
4   Ted   18    male   China

Then the output python dict should look like this:

data = {
  1: {'name': 'Jim', 'age': 18, 'gender': 'male', 'location': 'China'},
  2: {'name': 'Ross', 'age': 18, 'gender': 'male', 'location': 'China'},
  3: {'name': 'Cara', 'age': 19, 'gender': 'female', 'location': 'Japan'},
  4: {'name': 'Ted', 'age': 18, 'gender': 'male', 'location': 'China'}
}

You can see that there are a lot of duplicate infos in this 2-d dict (and for real data, it has the same condition), so I came up with an idea of developing a new dict with shared memory. To be specific, in the example above, I want my 2-d dict only save one copy of {'age': 18, 'gender': 'male', 'location': 'China'} across multiple rows (these rows do not need to be adjacent). If we call data[1]['age'] and data[2]['age'], it should do the lookup in the same extracted small shared dict.

I have read the source code of python dict, and I know python dict only save pointers to keys and values (and usually for small int and string object, different pointers may point to the same object). So when I mean I want to save only one copy, I mean one copy of pointers.

Any idea about how to design this dict? Thanks very much!!!

EDIT

Sorry, I forget to mention. The data in this 2-d dict will be read-only.

He Li
  • 31
  • 5
  • 3
    This sounds like a very bad idea. If you change Jim’s location, suddenly Ross and Ted have changed as well. – taylor swift Jul 06 '16 at 02:43
  • @Kelvin Sorry, I forgot to mention. The data in this dict will only be read-only :) – He Li Jul 06 '16 at 02:46
  • 3
    Unless there is an intrinsic shared component between these people, it sounds like what you want is a [compression algorithm](https://en.wikipedia.org/wiki/Data_compression). Note that in principle, this will make accessing items in the dictionary slower—a space vs. speed tradeoff. – taylor swift Jul 06 '16 at 02:49
  • @Kelvin Yeah, that's true. My intention indeed is to compress data. My idea now is to add multiple pointers, pointed to extracted shared dict, into 2nd level dict. If the lookup fails to find key in its original ma_table, it will look for keys in its related shared dicts. Although it will increase lookup times, but maybe I will save bunch of memory. – He Li Jul 06 '16 at 02:55
  • Concur with Kelvin here, but a way to get that behavior might be to subclass dict for your rows and override __getattr__ to look to some passed-in shared reference? – Adam Benzan Jul 06 '16 at 02:58
  • @AdamBenzan Thanks for your suggestion, Adam. But beforehand, I do not know whether a key resides in its own ma_table or an extracted shared dict, so I have to do the lookup in order. And I want to implement this dict in C/C++ as a python extension. – He Li Jul 06 '16 at 03:04
  • 1
    If the data are short strings, they will be interned anyway, so the overhead of having the same strings in each dict will automatically be lowered. – BrenBarn Jul 06 '16 at 03:13
  • @brenbarn Thanks Bren. Unfortunately, my data are not always short strings. Some strings may contain special characters and are pretty long. Python won't intern them automatically. And in addition, What I want to design is a more general 2-d dict, data may contain tuples, list, and even a dict. – He Li Jul 06 '16 at 03:19
  • @BrenBarn How short do they need to be to be interned automatically? I tested `s = [str(i%10) for i in range(100)]; print(len(set(map(id, s))))` and that printed `100`. That's one-character strings that apparently don't get interned. – Stefan Pochmann Jul 06 '16 at 03:21
  • @StefanPochmann: When I do that I get 10. (The behavior may have changed between Python versions.) However, I guess this only happens for string *literals* (see [this question](http://stackoverflow.com/questions/17679861/does-python-intern-strings)), although you can force it. – BrenBarn Jul 06 '16 at 03:32
  • @BrenBarn What Python version do you use? I use 3.5.1. And what do you get with `%20`? – Stefan Pochmann Jul 06 '16 at 03:41
  • It's hard to understand what you want because you're describing your "2d dict" type as being both highly specialized and very general :). Honestly it sounds like you want a table (PyTables) or data frame (pandas) like object. I believe the packages come with built in compression support. – Bi Rico Jul 06 '16 at 03:44
  • @BiRico Sorry for the confusion... Maybe you are right, what I really need is table-like data structure. I will try the PyTable to see whether it fits well. Thanks for your help :) – He Li Jul 06 '16 at 05:23

1 Answers1

1

I guess you are asking about a data compression solution, which should then consider both memory sizes and use of references. The smallest memory footprint typically belongs to an integer which should be at least as small as a memory reference, so I would try to map everything to integers unless it is too inconvenient. Also, lists are smaller than dictionaries and allow direct fast indexing.

Here is an alternative implementation that might spark some ideas:

import sys

data = {
  1: {'name': 'Jim', 'age': 18, 'gender': 'male', 'location': 'China'},
  2: {'name': 'Ross', 'age': 18, 'gender': 'male', 'location': 'China'},
  3: {'name': 'Cara', 'age': 19, 'gender': 'female', 'location': 'Japan'},
  4: {'name': 'Ted', 'age': 18, 'gender': 'male', 'location': 'China'}
}

In [43]: sys.getsizeof(data)
Out[43]: 280    # bytes

data_list = [ 
  ('Jim', 18, 0, 'CH'),     # 'CH' => 'China'
  ('Ross', 18, 0, 'CH'),    #  0 => Female, 1 => Male
  ('Cara', 19, 1, 'JP'),    # 'JP' => 'Japan'
  ('Ted', 18, 0, 'CH')
]


In [44]: sys.getsizeof(data_list)
Out[44]: 104   # bytes

_name, _age, _gender, _location = 0, 1, 2, 3

In [45]: data_list[2][_age]  # access as 2D array instead of 2-level dict
Out[45]: 19

The solution above will be a little slower but yield some benefits for large strings. Using references probably won't save you anything unless each record starts to get long. Finally, if you replace all values with integers instead of string names and country codes, you will compress using Python lists quite a bit.

If you really want to get into choosing numeric codes that will give best compression, look into Huffman Coding, eg this site: http://www.geeksforgeeks.org/greedy-algorithms-set-3-huffman-coding

clocker
  • 1,376
  • 9
  • 17