0

I want to create a column in a pandas dataframe that would add the values of the other columns (which are 0 or 1s). the column is called "sum"

my HEADPandas looks like:

     Application  AnsSr sum  Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11      0.0   0.0  28/12/11      ....    ...Dates...    28/12/11
~00c        0      0.0   0.0  0    0     0  ....    0       0       0
~00pr       0      0.0   0.0  0    0     0  ....    0       0       0
~00te       0      0.0   0.0  0    0     1  ....    0       0       1

in an image from pythoneverywhere: enter image description here

expected result (assuming there would be no more columns

     Application  AnsSr sum  Col1  Col2  Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11      0.0   nan  28/12/11        ....    ...Dates...    28/12/11
~00c        0      0.0   0.0   0    0     0  ....    0      0        0
~00pr       0      0.0   0.0   0    0     0  ....    0      0        0
~00te       0      0.0   2     0    0     1  ....    0      0        1

as you see the values of 'sum' are kept 0 even if there are 1s values in some columns. what Am I doing wrong?

The basics of the code are:

theMatrix=pd.DataFrame([datetime.today().strftime('%Y-%m-%d')],['Date'],['Application'])
theMatrix['Ans'] = 0
theMatrix['sum'] = 0

so far so good then I add all the values with loc. and then I want to add up values with

theMatrix.fillna(0, inplace=True)
# this being the key line:
theMatrix['sum'] = theMatrix.sum(axis=1)
theMatrix.sort_index(axis=0, ascending=True, inplace=True)

As you see in the result (attached image) the sum remains 0. I had a look to here or here and to the pandas documentation at no avail. Actually the expression:

theMatrix['sum'] = theMatrix.sum(axis=1)

I got it from there.

changing this last line by:

theMatrix['sum'] = theMatrix[3:0].sum(axis=1)

in order to avoid to sum the first three columns gives as result:

     Application  AnsSr sum  Col1  Col2  Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11      0.0   nan  28/12/11        ....    ...Dates...    28/12/11
~00c        0      0.0   nan   1    1     0  ....    0      0        0
~00pr       0      0.0   1.0   0    0     0  ....    0      0        1
~00te       0      0.0   0     0    0     0  ....    0      0        0

please observe two things: a) how in row '~00c' sum is nan but there are 1s in that row. b) before the calculating the sum the code theMatrix.fillna(0, inplace=True) should have change all possible nan into 0 so the sum should never be nan since in theory there are no nan values in any of the columns[3:]

it wouldnt work.

some idea?

thanks

PS: Later edition, just in case you wondere how the dataframe is populated: reading and parsing an XML and the lines are:

# myDocId being the name of the columns
# concept being the index.
theMatrix.loc[concept,myDocId]=1 
JFerro
  • 3,203
  • 7
  • 35
  • 88
  • 1
    Please post samples in text format, images samples are NOT recommended. – RavinderSingh13 Dec 11 '18 at 14:19
  • Thanks for the coment. Image is only a small image of the dataframe. all the code is in text – JFerro Dec 11 '18 at 14:27
  • Yes, please add TEXT sample of input and expected data in your post. – RavinderSingh13 Dec 11 '18 at 14:30
  • `theMatrix['sum'] = theMatrix.sum(axis=1)` should give the sum of all values in a row (i.e. including column `sum` as well), which is fine if you do it once. In order not to include the column `sum` use something like `theMatrix['sum'] = theMatrix[[col_1, col_2]].sum(axis=1). – Julia Dec 11 '18 at 14:38
  • Are you getting any errors? Your code works fine for me. I get correct `sum` count. – Julia Dec 11 '18 at 14:54
  • @Julia what it seems to be clear is that: theMatrix['sum'] = theMatrix[3:].sum(axis=1) #theMatrix['sum'] = theMatrix.sum(axis=1) – JFerro Dec 11 '18 at 15:25
  • yes, i edited the last part of the question to clarifry where are the errors. thx – JFerro Dec 11 '18 at 16:13
  • 1
    Your first slice [3:] is cutting the columns off, your still summing data in the date column providing you the nan value in the first column. In the second column, it may or may not be a value error, or issue in the data itself. I would try the updated code below to see if that would help. I try not to call functions on data of incorrect type. – johnnyb Dec 11 '18 at 16:44
  • 1
    Hi @johnnyb, its a hell of a lot of data, but you gave me an idea: pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) and go over all the data after printing. – JFerro Dec 11 '18 at 16:46
  • Any updates after looking at the data more closely? – johnnyb Dec 11 '18 at 21:14
  • With the following two properties: pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) I display all the contents of the dataframe, a bunch of 0s and 1s. there I dont see any value not being 1 or 0 in row 3 (which is the first that should give a added 'sum' value). nevertheless row 3, and row 6 'sum' is NaN, which can only be if there is a strange data in one of the values of row 3 and 5. And I dont see it. A pity I can not add a picture in a comment. – JFerro Dec 12 '18 at 11:07
  • Try and cast it as an int using df[3:].astype(int).sum(axis=1) – johnnyb Dec 13 '18 at 04:36

2 Answers2

1

If I understand correctly, this can help you:

import pandas as pd
import datetime

#create dataframe following your example
theMatrix=pd.DataFrame([datetime.datetime.today().strftime('%Y-%m-%d')],['Date'],['Application'])
theMatrix['Ans'] = 0
theMatrix['col1'] = 1
theMatrix['col2'] = 1

# create 'sum' column with summed values from certain columns
theMatrix['sum'] = theMatrix['col1'] + theMatrix['col2']
Neo
  • 627
  • 3
  • 7
  • Could it be a problem that the first row is actually not "addable" since it is populated with dates. I am using pythoneverywhere and in the console the 'sum' column would not add the values – JFerro Dec 11 '18 at 15:20
  • theMatrix['sum'] = theMatrix[3:].sum(axis=1) adds up some values, but not all, because there are some values in sum that now are NaN even if a) there are some 1s in the columns and b) if there are any it should remain 0 – JFerro Dec 11 '18 at 15:28
  • @JoseBerlines Updated to match your dataset a bit clearer. – johnnyb Dec 11 '18 at 15:31
1

Any data you choose to sum, just add to a list, and use that list to provide to your sum function, with axis=1. This will provide you the desired outcome. Here is a sample related to your data.

Sample File Data:

Date,a,b,c
bad, bad, bad, bad # Used to simulate your data better
2018-11-19,1,0,0
2018-11-20,1,0,0
2018-11-21,1,0,1
2018-11-23,1,nan,0 # Nan here is just to represent the missing data
2018-11-28,1,0,1
2018-11-30,1,nan,1 # Nan here is just to represent the missing data
2018-12-02,1,0,1

Code:

import pandas as pd
df = pd.read_csv(yourdata.filename) # Your method of loading the data
#rows_to_sum = ['a','b','c'] # The rows you wish to summarize
rows_to_sum = df.columns[1:] # Alternate method to select remainder of rows.
df = df.fillna(0) # used to fill the NaN you were talking about below. 
df['sum'] = df[rows_to_sum][1:].astype(int).sum(axis=1) # skip the correct amount of rows here. 
# Also, the use of astype(int), is due to the bad data read from the top. So redefining it here, allows you to sum it appropriately. 
print(df)

Output:

       Date     a     b     c  sum
        bad   bad   bad   bad  NaN
 2018-11-19     1     0     0  1.0
 2018-11-20     1     0     0  1.0
 2018-11-21     1     0     1  2.0
 2018-11-23     1     0     0  1.0
 2018-11-28     1     0     1  2.0
 2018-11-30     1     0     1  2.0
 2018-12-02     1     0     1  2.0
johnnyb
  • 1,745
  • 3
  • 17
  • 47