1

I want to subtract the cell values from one column with cell values from another column and write the sum to a new column in an excel file. Then I want the sum, if not equal to 0, to be added to a list for later use. The data in my excel file are structured like this:

Name | Number | Name1 | Number1
Name2 | Number2 | Name3 | Number3
....
Namex | Numberx | Namey |Numbery

I want to subtract the numbers from each other and then add the sum to a new column like this:

Name| Number | Name1 | Number1 | Sum of (Number - Number1)

I have tried to use openpyxl to do this, but I am really confused because the docs are so different from earlier versions of Python to newer. I am working in Python 3.4. I am happy to get suggestions on which module you would recommend me using. The code I have so far is giving me errors because I am calling the excelfile as a generator, not a subscriptable. I am not sure how to search and read an excelfile and at the same time make it subscriptable so that it is possible to write to it. Could anyone please help me?

Here is my code:

from openpyxl import Workbook, load_workbook

def analyzexlsx(filepath):
    numbers = []
    excel_input = load_workbook(filepath)
    filepath = [pth for pth in Path.cwd().iterdir()
                  if pth.suffix == '.xlsx'] #Want to iterate through several excel files in a folder.
    ws = excel_input.active
    cols = tuple(ws.columns)
    col_b = cols[1] 
    col_e = cols[4] 
    for j, k in zip(col_e, col_b): 
        if None:
            print('None')
        equally = (int(j.value) - int(k.value)) #line 13, error. Trying to subtract column cell values.
        if equally != 0: #If the columns sum is not equal to 0, it is to be added to the numbers list.
            numbers.append(j.row)

        else:
            pass

    col1 = []
    col2 = []
    col4 = []
    col5 = []
    col7 = []
    col8 = []

    mainlist = []
    try:
        for row in numbers:
            col1.append(str(ws.cell(row=row, column=1).value))
            col2.append(str(ws.cell(row=row, column=2).value))
            col4.append(ws.cell(row=row, column=4).value)
            col5.append(ws.cell(row=row, column=5).value)
            col7.append(ws.cell(row=row, column=7).value)
            col8.append(ws.cell(row=row, column=8).value)
    finally:
        for i, j, k, l, m, n in zip(col1, col2, col4, col5, col7, col8):
            mainlist.append(i + ", " + j + ", " + k + ", " + l + ", " + m + ", " + n)
    return mainlist

Traceback (most recent call last):
    Line 13, in analyzexlsx
        equally = (int(j.value) - int(k.value))
    TypeError: int() argument must be a string or a number, not 'NoneType

I would be really happy for answers as I have worked on this for quite a while and now I am stuck. I am fairly new to Python.

Pexe
  • 91
  • 3
  • 14

1 Answers1

3

First create DataFrame from excel by read_excel.

Then need substract 2. with 4 columns:

df = pd.read_excel('file.xlsx')

#select by column name
df['E'] = df['B'] - df['D']

#select by positions, but python count from 0 so for 2. column need 1
df['E'] = df.iloc[:, 1] - df.iloc[:, 3]

Maybe also help check documentation.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you, I will try this. Will let you know if it is working for me! – Pexe Jul 06 '17 at 10:50
  • Sure, no problem. – jezrael Jul 06 '17 at 10:51
  • I got the time to look at it a little bit, and I find it difficult to use this method when I have several excel files to go through. I need to go through all of the excelfiles in a directory and can not write "file.xlsx", as I do not know the name of all of the excel files from time to time. Have you got an idea on how to do this? – Pexe Jul 07 '17 at 12:32
  • I think you need [glob](https://docs.python.org/3/library/glob.html) like `files = glob.blob(*.xlsx)` for return all files names and then processes in loop by `for file in files:...` – jezrael Jul 07 '17 at 12:39
  • When trying to run this, I am getting a KeyError: 'E'. Do you know how to fix this? All of my excel files have got cell values in the columns. I am writing my code like this for testing: import pandas as pd import glob files = glob.glob(r'path\*.xlsx') for file in files: df = pd.read_excel(files) df['G'] = df['E'] - df['B'] if df['G'] != 0: print(df['G']) – Pexe Jul 11 '17 at 06:35
  • E is column name. Maybe need change it. Check column names by `df = pd.read_excel(files) print (df.columns.tolist())` – jezrael Jul 11 '17 at 09:37
  • I have tried writing df[int[:,6]] = df[int[:,2]] - df[int[:,4]] as well, but this is giving me TypeError: 'type' object is not subscriptable, and I am back to my main issue. Could it be that it is my excel file there is something wrong with? – Pexe Jul 11 '17 at 12:44
  • But you need `iloc`, not `int`. Also there is 7 and more columns? – jezrael Jul 11 '17 at 12:46
  • need `df.iloc[:,6] = df.iloc[:, 2] - df.iloc[:, 4]`, but there has to be 7 and more columns. – jezrael Jul 11 '17 at 12:47
  • Column A, B, D, E, G and H have cell values, is that a problem? When using the method you are suggesting (df,iloc) I get the error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). – Pexe Jul 11 '17 at 13:08
  • Ok, if use https://dl.dropboxusercontent.com/u/84444599/file_sample.xlsx sample, how does it work? – jezrael Jul 11 '17 at 13:15
  • And code is `df = pd.read_excel('https://dl.dropboxusercontent.com/u/84444599/file_sample.xlsx') print (df) df.iloc[:,6] = df.iloc[:, 2] - df.iloc[:, 4] print (df)` – jezrael Jul 11 '17 at 13:17
  • What is same as `df = pd.read_excel('https://dl.dropboxusercontent.com/u/84444599/file_sample.xlsx') print (df) df['G'] = df['C'] - df['E'] print (df)` – jezrael Jul 11 '17 at 13:18
  • This actually worked. I do not understand why it will not do the same with my excel files. – Pexe Jul 11 '17 at 13:25
  • Hmmm, hard to answer without your files. Some ideas - Are all values for substract numeric? First row in files with header is not missing (it works, but first data row is converted to columns names). It failed in all files? Or only in some of them? – jezrael Jul 11 '17 at 13:50
  • It fails on all of the files. I am able to print the content, but when I try to subtract or even write if df['G'] != 0: print(df['G'] I get the ValueError. – Pexe Jul 12 '17 at 08:18
  • Sorry, I dont notice comment. You need `any()` methof for check if at least one value is True, because working with arrays like `if (df['G'] != 0).any(): print(df['G']`. Maybe better explain it [this](https://stackoverflow.com/a/42071828/2901002) answer. – jezrael Jul 18 '17 at 06:24
  • Do you mean that it is not possible for me to get the cell values not equal to 0 and put them in a list? It is only possible to check if there are any? Because getting those cell values that are not equal to 0 are the entire reason that I am making this code. As I wrote in the question as well. – Pexe Jul 26 '17 at 11:23
  • @Pexe - you are working with arrays, so need a but change scalar solution. But now I am not sure if undersnatd your comment. Do you need all not `0` values to list? If yes, need `L = df['G'][df['G'] != 0].tolist()` It si called [`boolean indexing`](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing). – jezrael Jul 26 '17 at 11:27
  • Or better `L = df.loc[df['G'] != 0, 'G'].tolist()` – jezrael Jul 26 '17 at 11:38
  • When I try to write L = df.loc[df.iloc[:,6] !=0].tolist() I get an AttributeError: 'DataFrame' object has no attribute 'tolist'. I cannot use the df['G'], because I do not have names for the columns. – Pexe Jul 27 '17 at 07:54
  • Then use trick `L = df.loc[df.iloc[:,6] !=0, df.columns[6]].tolist()` – jezrael Jul 27 '17 at 07:56
  • Thank you! Is there any way to write the subtraction to the list? Something like this: L = df.loc[df.iloc[:,2] - df.iloc[:,4] !=0].tolist()? – Pexe Jul 28 '17 at 06:46
  • Yes, exactly. `df.iloc[:,2]` return `Series` (column of `df`), so you can use all function working with Series. – jezrael Jul 28 '17 at 06:53
  • But when I write it like this (L = df.loc[df.iloc[:,2] - df.iloc[:,4] !=0].tolist()) I get the error message again: AttributeError: 'DataFrame' object has no attribute 'tolist'. I want the sum of column 2 - column 4 to be added to the list. – Pexe Jul 28 '17 at 08:12
  • ooops, I cannot see it. The best is 2 steps solution - first filter and then substract like `df1 = df[df.iloc[:,4] !=0]` and `L = (df1.iloc[:,2] - df1.iloc[:,4]).tolist()` – jezrael Jul 28 '17 at 08:15
  • I want the sum of the subtraction that is not equal to zero to be added to the list, when I use this method I do not get that. Is there any way to do this? – Pexe Jul 28 '17 at 09:40
  • It is a bit complicated without data. So if have this data `np.random.seed(10)` and then `df1 = pd.DataFrame(np.random.randint(5, size=(5, 8)))` what is desired output? – jezrael Jul 28 '17 at 10:18
  • Column 2 and column 4 contain cell values as numbers, like: 1, 2, 4, 3. I want the cell values to be subtracted so that if the number 2 is in column 2 and the number 1 is in column 4 the sum is 1. This value will be added to the list. But if column 2 has the number 2 and column 4 also has the number 2 then the sum is 0 and I want this to be added to the list. When the sum of column 2 - column 4 is not equal to 0 I want the entire row to be added to the list. Is this possible? – Pexe Aug 07 '17 at 11:14
  • Not sure if understand, but seems it is possible. The best is create new question with sample data (e.g. `df1 = pd.DataFrame({'a':[1,5,8],'b':[7,4,8]})`), your code and desired output. Because format of comments is horrible. Thank you. – jezrael Aug 07 '17 at 11:28