1

I have a huge dataset with 1000+ columns. Most of them contains *NaN's * or just a few values. Manual sifting through each column is an unreasonable waste of time. How can I do an estimate column diversity, top freq values, etc with a single command?

Denis Kulagin
  • 8,472
  • 17
  • 60
  • 129
  • 6
    `pandas.DataFrame.describe()` is featured very early on in the introductory text of pandas' documentation: http://pandas.pydata.org/pandas-docs/stable/10min.html as is counting unique values: http://pandas.pydata.org/pandas-docs/stable/10min.html#histogramming – Paul H Mar 09 '17 at 17:38
  • What do you mean by "few" values? Do you expect discrete repeated values or floats? – FLab Mar 16 '17 at 16:53

1 Answers1

0

First, you need to get what single column contains, so you can make a for loop like that:

column = [array[i] for i in range(0,len(array), STEP]

where STEP = the number of columns in your file

Then you can do whatever you want with that. Answering to your questions, you can use i.e. max(column) - min(column), that will give you diversity. To get top common values I suggest you look there:

click

Community
  • 1
  • 1