I have a huge dataset with 1000+ columns. Most of them contains *NaN's * or just a few values. Manual sifting through each column is an unreasonable waste of time. How can I do an estimate column diversity, top freq values, etc with a single command?
Asked
Active
Viewed 243 times
1
-
6`pandas.DataFrame.describe()` is featured very early on in the introductory text of pandas' documentation: http://pandas.pydata.org/pandas-docs/stable/10min.html as is counting unique values: http://pandas.pydata.org/pandas-docs/stable/10min.html#histogramming – Paul H Mar 09 '17 at 17:38
-
What do you mean by "few" values? Do you expect discrete repeated values or floats? – FLab Mar 16 '17 at 16:53
1 Answers
0
First, you need to get what single column contains, so you can make a for loop like that:
column = [array[i] for i in range(0,len(array), STEP]
where STEP = the number of columns in your file
Then you can do whatever you want with that. Answering to your questions,
you can use i.e. max(column) - min(column)
, that will give you diversity.
To get top common values I suggest you look there:

Community
- 1
- 1

Paweł Balawender
- 73
- 1
- 5