0

I have a big list of company information in an excel spreadsheet. I need to bring the company info into my program to process.

Each company has a unique label which is used for accessing the companies. I can create a dictionary using the labels as the keys and the company info as the values, such as {label1: company1, label2: company2, ...}. By doing it this way, when the dictionary is created, it eats up too much memory.

Is it possible to create a generator that can be used like a dictionary?

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
zhuhuren
  • 337
  • 4
  • 7
  • 1
    @JoshLee: the OP explicitly states he want to avoid creating a dictionary... – Willem Van Onsem Mar 22 '17 at 15:13
  • 1
    Can you give more context about what your processing is about. Maybe you don't need a dictionary after all – Moses Koledoye Mar 22 '17 at 15:14
  • 2
    You can create an object with a `__getitem__` method that looks stuff up on the fly when you call `mydata[...]` if that's what you want. – khelwood Mar 22 '17 at 15:14
  • 3
    You need to define the problem you're actually trying to solve. Is it fast key-based access to this data? Sequential access? Merging of records with the same key? How big is the data? What are the memory constraints and what does 'eats up too much memory' mean? – pvg Mar 22 '17 at 15:16
  • Would pandas [to_dict](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html) be of use to you at all? May be worth a shot, if I'm understanding the needs correctly! – Clusks Mar 22 '17 at 15:17
  • 3
    Another option if you can't fit all the keys in memory is to use an sqlite db or a quick pkl file. Because you'll still need the data somewhere to look it up. But you could create a generator that iterates through the file and returns a small tuple for each line. – Keef Baker Mar 22 '17 at 15:17
  • generators work by yielding a value from an iterable one at a time. for that to happen you need to have that data. If you need all the data loaded in memory so you can process it then you will still need a data structure of some sorts. If you need to process them one at the time then you can just yield a row when you read the input document. – chaos Mar 22 '17 at 15:18
  • @chaos: in Python genrators can also take input. That is one of the surprising features of Python. – Willem Van Onsem Mar 22 '17 at 15:21
  • @WillemVanOnsem I don't think it's a surprising feature of python but it seems a little moot since the poster is (maybe?) asking about laziness and not really about generators. – pvg Mar 22 '17 at 15:23
  • @WillemVanOnsem that is not what I meant, what I meant is that a generator constructs an iterator and for that it needs a data source so it is about loading the whole data in memory or yielding it one bit at a time – chaos Mar 22 '17 at 15:24
  • The goal seems to be not having all of the data in memory. @KeefBaker is right, try using https://pypi.python.org/pypi/sqlitedict , which has a python dictionary interface that masks an underlying Sqlite DB – eqzx Mar 22 '17 at 15:25
  • 1
    No, you can't use a generator here, as generators don't (and can't, because it doesn't make sense) implement any methods to provide _random access_ to their elements as _they contain no elements_. – ForceBru Mar 22 '17 at 15:28
  • 2
    There is an unavoidable trade-off between memory and speed, a dictionary gives you (mostly) O(l) lookup but is held in memory, other more memory efficient approaches will be less rapid – Chris_Rands Mar 22 '17 at 15:33

3 Answers3

2

It seems the primary goal of the question is to have an object that behaves like a dictionary, without having the dictionary's contents in RAM (OP: "By doing it this way, when the dictionary is created, it eats up too much memory."). One option here is to use sqlitedict, which mimics the Python dictionary API, and uses a Sqlite database under the hood.

Here's the example from the current documentation:

>>> # using SqliteDict as context manager works too (RECOMMENDED)
>>> with SqliteDict('./my_db.sqlite') as mydict:  # note no autocommit=True
...     mydict['some_key'] = u"first value"
...     mydict['another_key'] = range(10)
...     mydict.commit()
...     mydict['some_key'] = u"new value"
...     # no explicit commit here
>>> with SqliteDict('./my_db.sqlite') as mydict:  # re-open the same DB
...     print mydict['some_key']  # outputs 'first value', not 'new value'
eqzx
  • 5,323
  • 4
  • 37
  • 54
1

You could create a class where you override the __getitem__ method. Like:

class Foo:

    def __getitem__(self,key):
        # ...
        # process the key
        # for example
        return repr(key)

Now if you create a Foo:

>>> somefoo = Foo()
>>> somefoo['bar']
"'bar'"
>>> somefoo[3]
'3'

So syntactically it works "a bit" like a dictionary.

You can also use a generator with send as is demonstrated in this answer:

def bar():
    while True:
        key = yield
        # process the key
        # for example
        yield repr(key)

and call it with:

>>> somebar = bar()
>>> next(somebar)
>>> somebar.send('bar')
"'bar'"
>>> next(somebar)
>>> somebar.send(3)
'3'
Community
  • 1
  • 1
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
1

Assuming the problem you are facing is accessing key-value structured data from a csv file, you have 3 options:

  1. Load the entire data into a dictionary, copying it into RAM as a whole, and then have fast, constant access time. This is what you said you want to avoid.
  2. Searching through the data line-by-line every time you want to access data by a key. This does not have any memory overhead, but needs to scan over the entire document each time, having linear access time.
  3. Use or copy the data into some database engine (or any key-value storage), which supports disk-based indexing, allowing for constant time access while not requiring to load the data into memory first.
Felk
  • 7,720
  • 2
  • 35
  • 65