7

I have this code below. It is surprizing for me that it works for the columns and not for the rows.

import pandas as pd

def summarizing_data_variables(df):
    numberRows=size(df['ID'])
    numberColumns=size(df.columns)
    summaryVariables=np.empty([numberColumns,2], dtype =  np.dtype('a50'))    
    cont=-1    
    for column in df.columns:
        cont=cont+1
        summaryVariables[cont][0]=column
        summaryVariables[cont][1]=size(df[df[column].isin([0])][column])/(1.0*numberRows)
    print summaryVariables

def summarizing_data_users(fileName):
    print "Sumarizing users..."   
    numberRows=size(df['ID'])
    numberColumns=size(df.columns)      
    summaryVariables=np.empty([numberRows,2], dtype =  np.dtype('a50'))    
    cont=-1

    for row in df['ID']:
        cont=cont+1
        summaryVariables[cont][0]=row
        dft=df[df['ID']==row]
        proportionZeros=(size(dft[dft.isin([0])])-1)/(1.0*(numberColumns-1)) # THe -1 is used to not count the ID column
        summaryVariables[cont][1]=proportionZeros
    print summaryVariables


if __name__ == '__main__':

    df = pd.DataFrame([[1, 2, 3], [2, 5, 0.0],[3,4,5]])
    df.columns=['ID','var1','var2']
    print df

    summarizing_data_variables(df)
    summarizing_data_users(df) 

The output is this:

   ID  var1  var2
0   1     2     3
1   2     5     0
2   3     4     5
[['ID' '0.0']
 ['var1' '0.0']
 ['var2' '0.333333333333']]
Sumarizing users...
[['1' '1.0']
 ['2' '1.0']
 ['3' '1.0']]

I was expecting that for users:

Sumarizing users...
[['1' '0.0']
 ['2' '0.5']
 ['3' '0.0']]

It seems that the problem is in this line:

dft[dft.isin([0])]

It does not constrain dft to the "True" values like in the first case.

Can you help me with this? (1) How to correct the users (ROWS) part (second function above)? (2) Is this the most efficient method to do this? [My database is very big]

EDIT:

In function summarizing_data_variables(df) I try to evaluate the proportion of zeros in each column. In the example above, the variable Id has no zero (thus the proportion is zero), the variable var1 has no zero (thus the proportion is also zero) and the variable var2 presents a zero in the second row (thus the proportion is 1/3). I keep these values in a 2D numpy.array where the first column is the label of the column of the dataframe and the second column is the evaluated proportion.

The function summarizing_data_users I want to do the same, but I do that for each row. However, it is NOT working.

DanielTheRocketMan
  • 3,199
  • 5
  • 36
  • 65
  • 1
    using loops for looping through column/rows of your DFs is definitely not the best approach when working with Pandas. Just explain what are you doing in `summarizing_data_variables` and in `summarizing_data_users` functions (because it's not quite clear) and we will try to find efficient Pandas way to do that... – MaxU - stand with Ukraine Mar 06 '16 at 16:49
  • @MaxU thank you. See my edit above, please. – DanielTheRocketMan Mar 06 '16 at 17:01
  • 1
    to briefly expand on @MaxU comments, here you really only need to show (1) sample data, (2) what you tried, (3) desired results. Not a bad question, but it could be a bit more concise. ;-) – JohnE Mar 06 '16 at 17:04
  • @JohnE I tried to explore the fact that both functions are very similar to each other and one of them works for one of the cases and the other one does not work. Sorry (:-( – DanielTheRocketMan Mar 06 '16 at 17:08

2 Answers2

8

try this instead of the first funtion:

print(df[df == 0].count(axis=1)/len(df.columns))

UPDATE (correction):

print('rows')
print(df[df == 0].count(axis=1)/len(df.columns))
print('cols')
print(df[df == 0].count(axis=0)/len(df.index))

Input data (i've decided to add a few rows):

ID  var1  var2
1     2     3
2     5     0
3     4     5
4    10    10
5    1      0

Output:

rows
ID
1    0.0
2    0.5
3    0.0
4    0.0
5    0.5
dtype: float64
cols
var1    0.0
var2    0.4
dtype: float64
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
6

My favorite way of getting number of nonzeros in each column is

df.astype(bool).sum(axis=0)

For the number of non-zeros in each row use

df.astype(bool).sum(axis=1)

Notice:

If you have nans in your df you should make these zero first, otherwise they will be counted as 1.

df.fillna(0).astype(bool).sum(axis=1)
Kevin Chou
  • 489
  • 5
  • 8
  • related question :[Counting non zero values in each column of a dataframe in python](https://stackoverflow.com/questions/26053849/counting-non-zero-values-in-each-column-of-a-dataframe-in-python) – Kevin Chou Apr 10 '19 at 06:59