I'm currently on some heavy data analytics projects, and am trying to create a Python wrapper class to help streamline a lot of the mundane preprocessing steps involved when cleaning data, partitioning it into test / validation sets, standardizing it, etc. The idea ultimately is to transform raw data into easily consumable processed matrices for machine learning algorithms to input for training and testing purposes. Ideally, I'm working towards the point where
data = DataModel(AbstractDataModel)
processed_data = data.execute_pipeline(**kwargs)
So in many cases I'll start off with a self.df
, which is a pandas
dataframe object for my instance. But one method may be called standardize_data()
and will ultimately return a standardized dataframe called self.std_df
.
My IDE has been complaining heavily about me initializing variables outside of __init__
. So to try to soothe PyCharm, I've been using the following code inside my constructor:
class AbstractDataModel(ABC):
@abstractmethod
def __init__(self, input_path, ..., **kwargs):
self.df_train, self.df_test, self.train_ID, self.test_ID, self.primary_key, ... (many more variables) = None, None, None, None, None, ...
Later on, these properties are being initialized and set. I'll admit that I'm coming from heavy-duty Java Spring projects, so I'm still used to verbosely declaring variables. Is there a more Pythonic way of declaring my instance properties here? I know I must be violating DRY with all the None
values.
I've researched on SO, and came across this similar question, but the answer that is provided is more about setting instance variables through argv
, so it isn't a direct solution in my context.