Reading values in column x from specific worksheets using pandas

Question

I am new to python and have looked at a number of similar problems on SO, but cannot find anything quite like the problem that I have and am therefore putting it forward:

I have an .xlsx dataset with data spread across eight worksheets and I want to do the following:

sum the values in the 14th column in each worksheet (the format, layout and type of data (scores) is the same in column 14 across all worksheets)
create a new worksheet with all summed values from column 14 in each worksheet
sort the totaled scores from highest to lowest
plot the summed values in a bar chart to compare

I cannot even begin this process because I am struggling at the first point. I am using pandas and am having trouble reading the data from one specific worksheet - I only seem to be able to read the data from the first worksheet only (I print the outcome to see what my system is reading in).

My first attempt produces an `Empty DataFrame':

import pandas as pd

y7data = pd.read_excel('Documents\\y7_20161128.xlsx', sheetname='7X', header=0,index_col=0,parse_cols="Achievement Points",convert_float=True)
print y7data

I also tried this but it only exported the entire first worksheet's data as opposed to the whole document (I am trying to do this so that I can understand how to export all data). I chose to do this thinking that maybe if I exported the data to a .csv, then it might give me a clearer view of what went wrong, but I am nonethewiser:

import pandas as pd
import numpy as np

y7data = pd.read_excel('Documents\\y7_20161128.xlsx')

y7data.to_csv("results.csv")

I have tried a number of different things to try and specify which column within each worksheet, but cannot get this to work; it only seems to produce the results for the first worksheet.

How can I, firstly, read the data from column 14 in every worksheet, and then carry out the rest of the steps?

Any guidance would be much appreciated.

UPDATE (for those using Enthought Canopy and struggling with openpyxl):

I am using Enthought Canopy IDE and was constantly receiving an error message around openpyxl not being installed no matter what I tried. For those of you having the same problem, save yourself lots of time and read this post. In short, register for an Enthought Canopy account (it's free), then run this code via the Canopy Command Prompt:

enpkg openpyxl 1.8.5

score 2 · Answer 1 · edited May 23 '17 at 10:30

I think you can use this sample file:

First read all columns in each sheet to list of columns called y7data:

y7data = [pd.read_excel('y7_20161128.xlsx', sheetname=i, parse_cols=[13]) for i in range(3)]
print (y7data)
[   a
0  1
1  5
2  9,    a
0  4
1  2
2  8,    a
0  5
1  8
2  5]

Then concat all columns together, I add keys which are used for axis x in graph, sum all columns, remove second level of MultiIndex (a, a, a in sample data) by reset_index and last sort_values:

print (pd.concat(y7data, axis=1, keys=['a','b','c']))
   a  b  c
   a  a  a
0  1  4  5
1  5  2  8
2  9  8  5

summed = pd.concat(y7data, axis=1, keys=['a','b','c'])
           .sum()
           .reset_index(drop=True, level=1)
           .sort_values(ascending=False)

print (summed)
c    18
a    15
b    14
dtype: int64

Create new DataFrame df, set column names and write to_excel:

df = summed.reset_index()#.
df.columns = ['a','summed']
print (df)
   a  summed
0  c      18
1  a      15
2  b      14

If need add new sheet use this solution:

from openpyxl import load_workbook

book = load_workbook('y7_20161128.xlsx')
writer = pd.ExcelWriter('y7_20161128.xlsx', engine='openpyxl') 
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, "Main", index=False)
writer.save()

Last Series.plot.bar:

import matplotlib.pyplot as plt

summed.plot.bar()
plt.show()

This looks good (I have run it in Enthought Canopy and achieved the same results). Would you please annotate/explain what is happening in the code? — Mus, Nov 28 '16 at 13:43
Thank you; in fact, I have noticed one accidental omission on my part (create a new worksheet with all summed values from column 14 in each worksheet) and have updated the question accordingly. — Mus, Nov 28 '16 at 13:44

Jivan · Answer 2 · 2016-11-28T13:49:32.500

0

From what I understand, your immediate problem is managing to load the 14th column from each of your worksheets.

You could be using ExcelFile.parse instead of read_excel and loop over your sheets.

xls_file = pd.ExcelFile('Documents\\y7_20161128.xlsx')
worksheets = ['Sheet1', 'Sheet2', 'Sheet3']
series = [xls_file.parse(sheet, parse_cols=[13]) for sheet in worksheets]
df = pd.DataFrame(series)

And from that, sum() your columns and keep going.

Using ExcelFile and then ExcelFile.parse() has the advantage to load your Excel file only once, and iterate over each worksheet. Using read_excel makes your Excel file to be loaded in each iteration, which is useless.

Documentation for pandas.ExcelFile.parse.

edited Nov 28 '16 at 13:49

answered Nov 28 '16 at 13:46

Jivan

21,522
15
80
131

Thank you, but this produces an error: `KeyError: 14` – Mus Nov 28 '16 at 13:49
Yes sorry this was a typo, I corrected it from `[14]` to `.ix[14]` – Jivan Nov 28 '16 at 13:50
Thank you, this is interesting. How do I then export the results into a new spreadsheet (I understand that `xls_file.to_csv` or `xls_file.to_excel` only works with `read_excel')? – Mus Nov 28 '16 at 13:59
Nope, `to_excel` works with any DataFrame. Once in a DataFrame, your data is completely decoupled from where it came from (Excel, csv file, JSON...). So you can export it in any form that you like. – Jivan Nov 28 '16 at 14:02
I must be doing something wrong then because I receive this error message when I try to export: `AttributeError: 'ExcelFile' object has no attribute 'to_excel'`. Here is my code: `xls_file.to_excel("Results.xlsx")` – Mus Nov 28 '16 at 14:09
Oh yes I understand - you have to call `to_excel` on your DataFrame object, not on your xls_file object. For instance `df.to_excel()` – Jivan Nov 28 '16 at 14:17
I have tried this and am now receiving this error: `ImportError: No module named openpyxl`. I'm guessing that I haven't imported something that should have been imported - correct? – Mus Nov 28 '16 at 14:54
Maybe you should `pip install openpyxl` then - https://openpyxl.readthedocs.io/en/default/ - but I guess we're getting out of the scope of the original question - perhaps you should post another one – Jivan Nov 28 '16 at 15:03
Yes, we may well be as I am now receiving an error: `ImportError: No module named openpyxl ` even after following the examples on readthedocs. – Mus Nov 28 '16 at 15:15

Reading values in column x from specific worksheets using pandas

2 Answers2