0

Hello Stackoverflow folks,... I hope this questions is not already answered. After half a day of googeling I did resign myself to asking a question here. My problem is the following:

I want to create a class which takes some information and processes this information:

    #Klassendefinition für eine Instanz von Rohdaten
class raw_data():   
    def __init__(self, filename_rawdata, filename_metadata,
                 file_format, path, category, df_raw, df_meta):
        self.filename_rawdata = filename_rawdata
        self.filename_metadata = filename_metadata
        self.file_format = file_format
        self.path = path
        self.category = category
        self.df_raw = getDF(self.filename_rawdata)
        self.df_meta = getDF(self.filename_metadata)

    # generator
    def parse(self, path):
        g = gzip.open(path, 'rb')
        for l in g:
            yield eval(l)

    # function that returns a pandas dataframe with the data
    def getDF(self, filename):
        i = 0
        df = {}
        for d in self.parse(filename):
            df[i] = d
            i += 1
        return pd.DataFrame.from_dict(df, orient='index')

Now I have a problem with the init method, I would like to run the class method below on default when the class in instantiated, but I somehow cannot manage to get this working. I have seen several other posts here like [Calling a class function inside of __init__ [1]: Python 3: Calling a class function inside of __init__ but I am still not able to do it. The first question did work for me, but I would like to call the instance variable after the constructor ran.

I tried this:

class raw_data():   
    def __init__(self, filename_rawdata, filename_metadata,
                 file_format, path, category):
        self.filename_rawdata = filename_rawdata
        self.filename_metadata = filename_metadata
        self.file_format = file_format
        self.path = path
        self.category = category
        getDF(self.filename_rawdata)
        getDF(self.filename_metadata)

    # generator
    def parse(self, path):
        g = gzip.open(path, 'rb')
        for l in g:
            yield eval(l)

    # function that returns a pandas dataframe with the data
    def getDF(self, filename):
        i = 0
        df = {}
        for d in self.parse(filename):
            df[i] = d
            i += 1
        return pd.DataFrame.from_dict(df, orient='index')

But I get an error because getDF is not defined (obviously).. I hope this questions is not silly by any means. I need to do it that way, because afterwards I want to run like 50-60 instance calls and I do not want to repeat like Instance.getDF() ... for every instance, but rather would like to have it called directly.

  • 3
    `self.getDF(...)` – Phydeaux Feb 26 '19 at 15:24
  • As an aside, `raw_data.getDF` can be reduced to `return pd.DataFrame.from_dict(dict(enumerate(self.parse(filename))))` or (I think) `return pd.DataFrame.from_records(enumerate(self.parse(filename)))`. – chepner Feb 26 '19 at 15:29
  • Also, neither `parse` nor `getDF` need to be methods of your class; they could be defined as regular functions outside the class, or, if you really want to keep them in your class's namespace, be made static methods. – chepner Feb 26 '19 at 15:32

1 Answers1

0

All you need to so is call getDF like any other method, using self as the object on which it should be invoked.

self.df_raw = self.getDF(self.filename_rawdata)

That said, this class could be greatly simplified by making it a dataclass.

from dataclasses import dataclass

@dataclass
class RawData:
    filename_rawdata: str
    filename_metadata: str
    path: str
    category: str

    def __post_init__(self):
        self.df_raw = self.getDF(self.filename_rawdata)
        self.df_meta = self.getDF(self.filename_metadata)

    @staticmethod
    def parse(path):
        with gzip.open(path, 'rb') as g:
            yield from map(eval, g)

    @staticmethod
    def getDF(filename):
        return pd.DataFrame.from_records(enumerate(RawData.parse(filename)))

The auto-generated __init__ method will set the four defined attributes for you. __post_init__ will be called after __init__, giving you the opportunity to call getDF on the two given file names.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • Ah shit, I did look at the code x-times but somehow did not see that I am missing the ".self" before the function :-( sorry this was stupid from me. Never heard of this "dataclasses" before, this was an eyeopener for me, thank you so much!! – BayerischerSchweitzer Feb 26 '19 at 16:17