1

I used panda to generate a dataframe with several rows and columns.

I am now trying to determine the average number of decimals for each column. For example :

 A     B      C 
 10.1 22.541 21.44 
 10.2 23.548 19.4
 11.2 26.547 15.45

The program would return 1 for A, 3 for B and 2 for C

Would you have an effective method to do this, given that the dataframe I'm handling has about 16000 lines.

Thank you

AlexJJ
  • 93
  • 8
  • 1
    Can you give some examples to make it clear what you mean by "number of decimals"? It's ambiguous as it stands. – Mark Dickinson Nov 11 '19 at 20:14
  • Please provide some sample input along with the desired output. – Cleb Nov 11 '19 at 20:15
  • for example the program should return 2 for 2.98 and 1 for 2.1, and do a mean of theses values for the column :) – AlexJJ Nov 11 '19 at 20:16
  • 1
    Computing the number of decimals after the point is tricky: it's not a particularly well-defined notion, thanks to the use of binary floating-point. See [this excellent answer](https://stackoverflow.com/a/17838332/270986) from Keith Thompson on the subject. (It's about C, but the principle is the same: Python uses the same floating-point format.) – Mark Dickinson Nov 11 '19 at 20:22
  • I'll see that thanks ;) – AlexJJ Nov 11 '19 at 21:29
  • 1
    Why are you trying to do this? It seems like a really odd and impractical idea. – AMC Nov 11 '19 at 21:36
  • I've been asked to include this in a program so I do it , odd idea or not ;P – AlexJJ Nov 12 '19 at 20:10

1 Answers1

1

Updated code

Ok, here it is. May be little bit complicated ;)

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [10.1, 10.2, 11.2] ,'B': [22.541, 23.548, 26.547],'C':[21.44,19.4,15.45]})
df

Out[1]:
       A    B       C
0   10.1    22.541  21.44
1   10.2    23.548  19.4
2   11.2    26.547  15.45


[sum((df[col].astype(str).str.split('.', expand=True)[1]).apply(lambda x: len(str(x))))/len((df[col].astype(str).str.split('.', expand=True)[1]).apply(lambda x: len(str(x)))) for col in df.columns]

Out[2]:
[1.0, 3.0, 1.6666666666666667]

step by step realization

df1 = pd.DataFrame([(df[col].astype(str).str.split('.', expand=True)[1]).apply(lambda x: len(str(x))).values for col in df.columns]).T
df1

Out[3]:
    0   1   2
0   1   3   2
1   1   3   1
2   1   3   2

df1.mean()

Out[4]:
0    1.000000
1    3.000000
2    1.666667
dtype: float64
Alex
  • 1,118
  • 7
  • 7
  • Sorry I wasn't very clear I ask to count the numbers after the comma and average this:) – AlexJJ Nov 11 '19 at 20:24
  • could you show expression, please, how you get 1 for 'A', using the numbers 10.1, 10.2, 11.2 ? – Alex Nov 11 '19 at 20:28
  • for 10.1 it return 1, for 10.11 it return 2 for 10.112 it return 3. It's the number of decimals, number of caracters after the comma :) – AlexJJ Nov 11 '19 at 20:31
  • i've tried for my example dataframe and it work, but for my real data frame it raise an error "KeyError: 1" – AlexJJ Nov 11 '19 at 20:59
  • i just a bit simplified it, see my update. And what kind of error it gives you? – Alex Nov 11 '19 at 21:02
  • could you show the output for "df.columns" and "df.to_dict()"? – Alex Nov 11 '19 at 21:04
  • By testing i think it's because one of my columns is a boolean column with True and False. I'm trying to change df by a slice of wanted columns (without this column) . :) – AlexJJ Nov 11 '19 at 21:07
  • right. if even one column is not float type (has no '.' char) it would not be splitted on two elements.and can't be called as [1].. so it gives you exactly key error. Just for testing you able to do like this df = df.drop('name_of_boolean_coumn', axis = 1) – Alex Nov 11 '19 at 21:10
  • It work if i copy list(df.columns) and remove the not wanted columns. Thanks very much ! Also would you know if you can return 0 if the only decimal value is a 0? (that value=int(value) – AlexJJ Nov 11 '19 at 21:18