0

I am trying to generate a dataframe from a list of dictionaries. The list of dictionaries are generated via list comprehension referencing the object.

import pandas as pd


class Foo:
    def __init__(self, a, b):
        self.a = a
        self.b = b

    @property
    def rep(self):
        return {'a': self.a, 'b': self.b}


class Bar:
    def __init__(self):
        self.container = [Foo('1', '2'), Foo('2', '3'), Foo('3', '4')]

    def data(self):
        return [x.rep for x in self.container]


class Base:
    def __init__(self):
        self.all = {'A': [Bar(), Bar(), Bar()], 'B': [Bar(), Bar(), Bar()]}
        #

    def test(self):
        list_of_reps = []
        [list_of_reps.extend(b.data()) for bar in [self.all[x] for x in self.all] for b in bar]
        pd.DataFrame(list_of_reps)


if __name__ == '__main__':
    b = Base()
    b.test()

I then use the base class to combine all the dictionaries from the Foo class. This number can be several thousand and as the list grows I see that the conversion to a dataframe is slow as well as the data() method in Bar. Is there a more optimal way to generate this?

zeetitan
  • 35
  • 5
  • `list_of_reps = [x.data for x in "Base.bars"]` doesn't work, it returns `AttributeError: 'str' object has no attribute 'data'` – cs95 Dec 28 '20 at 01:21
  • Yes, sorry. I was trying to show an example of how I was creating the list of dicts. I can remove the formatting as code so it's clearer. – zeetitan Dec 28 '20 at 01:21

1 Answers1

1

I am trying to generate a dataframe from a list of dictionaries.

In some sense, this is guaranteed to be slow, because Python objects are less efficient than a row in a Pandas Dataframe. If you can avoid creating an object per row, that will save execution time.

[self.all[x] for x in self.all]

This is equivalent to self.all.values()

class Foo:

This can be replaced with a namedtuple, which is more memory-efficient. This also lets you avoid iterating in Bar.data().

Wherever possible, I would try to use iterators instead of lists, for memory efficiency.

Here's how I would change this example:

import pandas as pd
from collections import namedtuple
import itertools

Foo = namedtuple("Foo", "a b")

class Bar:
    def __init__(self):
        self.container = [Foo('1', '2'), Foo('2', '3'), Foo('3', '4')]

    def data(self):
        return self.container


class Base:
    def __init__(self):
        self.all = {'A': [Bar(), Bar(), Bar()], 'B': [Bar(), Bar(), Bar()]}

    def test(self):
        all_bars = itertools.chain.from_iterable(self.all.values())
        reps_generator = (bar.data() for bar in all_bars)
        reps_flattened = itertools.chain.from_iterable(reps_generator)
        print(pd.DataFrame(reps_flattened))


if __name__ == '__main__':
    b = Base()
    b.test()
Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • Thanks Nick. What I haven't shown here is the other list of attributes and methods that Foo / Bar contain, hence I left them as objects. – zeetitan Dec 28 '20 at 02:21
  • @zeetitan If you're using Python 3.6+, you could try using `typing.NamedTuple`, which lets you extend with new functionality: https://stackoverflow.com/a/44320510/530160 – Nick ODell Dec 28 '20 at 02:26
  • Thanks, I'll try it out! Will respond after testing it – zeetitan Dec 28 '20 at 02:29