0

I am currently performing multiple analysis steps on all the columns of my Pandas dataframe to get a good sense and overview of the data quality and structure (e.g. number of unique values, # missing values, # values by data type int/float/str ...).

My approach appears rather memory expensive and inefficient, especially with regards to larger data sets. I would really appreciate your thoughts on how to optimize the current process.

I am iterating over all the different columns in my dataset and create two dictionaries (see below) for every column separately which hold the relevant information. As I am checking/testing each row item anyways would it be reasonable to somehow combine all the checks in one go? And if so, how would you approach the issue? Thank you very much in advance for your support.

data_column = input_dataframe.loc[:,"column_1"]  # as example, first column of my dataframe

dictionary_column = {}

unique_values = data_column.nunique()
dictionary_column["unique_values"] = unique_values

na_values = data_column.isna().sum()
dictionary_column["na_values"] = na_values

zero_values = (data_column == 0).astype(int).sum()
dictionary_column["zero_values"] = zero_values

positive_values = (data_column > 0).astype(int).sum()
dictionary_column["positive_values"] = positive_values

negative_values = (data_column < 0).astype(int).sum()
dictionary_column["negative_values"] = negative_values

data_column.dropna(inplace=True)  # drop NaN otherwise elemts will be considered as float
info_dtypes = data_column.apply(lambda x: type(x).__name__).value_counts()

dictionary_data_types = {}  # holds the count of the different data types (e.g. int, float, datetime, str, ...)
for index, value in info_dtypes.iteritems():
    dictionary_data_types[str(index)] = int(value)
pythoneer
  • 403
  • 2
  • 4
  • 15
  • [Here](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758) you will find an amazing explanation about **do not iterate**. However, I don't think it is harmful to do in a small DataFrame – Paulo Marques Feb 02 '21 at 22:12
  • Thanks for the link. I don't really "iterate" over every item, as you can see from my code snippet (I use nunique, insa etc. and call the sum). However, as I use the same column (or series) several times to perform different calculations or analyses I was wondering if there is a way to streamline the whole process in one go? – pythoneer Feb 03 '21 at 07:50

0 Answers0