I am currently performing multiple analysis steps on all the columns of my Pandas
dataframe to get a good sense and overview of the data quality and structure (e.g. number of unique values, # missing values, # values by data type int/float/str ...).
My approach appears rather memory expensive and inefficient, especially with regards to larger data sets. I would really appreciate your thoughts on how to optimize the current process.
I am iterating over all the different columns in my dataset and create two dictionaries (see below) for every column separately which hold the relevant information. As I am checking/testing each row item anyways would it be reasonable to somehow combine all the checks in one go? And if so, how would you approach the issue? Thank you very much in advance for your support.
data_column = input_dataframe.loc[:,"column_1"] # as example, first column of my dataframe
dictionary_column = {}
unique_values = data_column.nunique()
dictionary_column["unique_values"] = unique_values
na_values = data_column.isna().sum()
dictionary_column["na_values"] = na_values
zero_values = (data_column == 0).astype(int).sum()
dictionary_column["zero_values"] = zero_values
positive_values = (data_column > 0).astype(int).sum()
dictionary_column["positive_values"] = positive_values
negative_values = (data_column < 0).astype(int).sum()
dictionary_column["negative_values"] = negative_values
data_column.dropna(inplace=True) # drop NaN otherwise elemts will be considered as float
info_dtypes = data_column.apply(lambda x: type(x).__name__).value_counts()
dictionary_data_types = {} # holds the count of the different data types (e.g. int, float, datetime, str, ...)
for index, value in info_dtypes.iteritems():
dictionary_data_types[str(index)] = int(value)