I have this code below. It is surprizing for me that it works for the columns and not for the rows.
import pandas as pd
def summarizing_data_variables(df):
numberRows=size(df['ID'])
numberColumns=size(df.columns)
summaryVariables=np.empty([numberColumns,2], dtype = np.dtype('a50'))
cont=-1
for column in df.columns:
cont=cont+1
summaryVariables[cont][0]=column
summaryVariables[cont][1]=size(df[df[column].isin([0])][column])/(1.0*numberRows)
print summaryVariables
def summarizing_data_users(fileName):
print "Sumarizing users..."
numberRows=size(df['ID'])
numberColumns=size(df.columns)
summaryVariables=np.empty([numberRows,2], dtype = np.dtype('a50'))
cont=-1
for row in df['ID']:
cont=cont+1
summaryVariables[cont][0]=row
dft=df[df['ID']==row]
proportionZeros=(size(dft[dft.isin([0])])-1)/(1.0*(numberColumns-1)) # THe -1 is used to not count the ID column
summaryVariables[cont][1]=proportionZeros
print summaryVariables
if __name__ == '__main__':
df = pd.DataFrame([[1, 2, 3], [2, 5, 0.0],[3,4,5]])
df.columns=['ID','var1','var2']
print df
summarizing_data_variables(df)
summarizing_data_users(df)
The output is this:
ID var1 var2
0 1 2 3
1 2 5 0
2 3 4 5
[['ID' '0.0']
['var1' '0.0']
['var2' '0.333333333333']]
Sumarizing users...
[['1' '1.0']
['2' '1.0']
['3' '1.0']]
I was expecting that for users:
Sumarizing users...
[['1' '0.0']
['2' '0.5']
['3' '0.0']]
It seems that the problem is in this line:
dft[dft.isin([0])]
It does not constrain dft to the "True" values like in the first case.
Can you help me with this? (1) How to correct the users (ROWS) part (second function above)? (2) Is this the most efficient method to do this? [My database is very big]
EDIT:
In function summarizing_data_variables(df) I try to evaluate the proportion of zeros in each column. In the example above, the variable Id has no zero (thus the proportion is zero), the variable var1 has no zero (thus the proportion is also zero) and the variable var2 presents a zero in the second row (thus the proportion is 1/3). I keep these values in a 2D numpy.array where the first column is the label of the column of the dataframe and the second column is the evaluated proportion.
The function summarizing_data_users I want to do the same, but I do that for each row. However, it is NOT working.