Normalizing data with many columns based on distribution

Question

I've been digging around about how to properly prepare data for clustering, and I came across this tutorial that explains you can't just randomly normalize each column, because normalizing a power law distribution will not yield a correct transformation (and you should use a log transform in that case).

I'm trying to transform a dataframe with 200+ columns (after preparing and removing mostly empty and autocorrelated columns). So my question is, is there a way to automatically check the distribution of each feature and then make the most fitting transformation (normalization for Gaussian distro, log transform for power law distro, using quantiles for "unrecognizable" distros etc.) automatically? Or is this something I have to do by hand for all those columns? Thank you!

There is a python package for that. You can look at [`fitter`](https://pypi.org/project/fitter/) to identify the most fitting distribution. — Michael Szczesny, Aug 28 '21 at 12:15

score 0 · Answer 1 · answered Aug 28 '21 at 15:55

0

If you are sure that your data only has two distributions (Normal and Exponential) then you may be able to use skewness stats as a way to identify the normal and non-normal distributions.

Otherwise, check this article out:

https://towardsdatascience.com/identify-your-datas-distribution-d76062fc0802

answered Aug 28 '21 at 15:55

Babak Fi Foo

926
7
17

I am not sure, I can have any number of distributions in 200+ columns, which is why I want to automate this :D – lte__ Aug 28 '21 at 16:39
1

You need to iterate over columns and use the test statistics results with the logical operators to transform them. For instance, if normal fits better, then automatically data is standardized. – Babak Fi Foo Aug 28 '21 at 16:42
Thanks. Unfortunately I don't have a paid acc for TDS, so I couldn't read the article (invoognito window also doesn't work...) – lte__ Aug 28 '21 at 16:52
Maybe this might help? https://stackoverflow.com/questions/37487830/how-to-find-probability-distribution-and-parameters-for-real-data-python-3 – Babak Fi Foo Aug 28 '21 at 17:06

Normalizing data with many columns based on distribution

1 Answers1