Using pandas to read certain columns from an xlsx, with a condition from another row

Question

I have an xlsx where the first 9 rows are headers. Row 1 contains a name, like "Bob" and "Alice".

Row 4 contains either 'Monthly' or 'Quarterly'.

Sometimes there are two fields called 'Bob' but one has 'Monthly' and the other has 'Quarterly' in row 4.

I understand I could read in the column called 'Bob' into a dataframe, but is there a way to specify which one should be loaded into the dataframe?

e.g. below I have bob and alice, and as it stands I would read in 2 Bob fields and 2 Alice fields. Is there a way of reducing these somehow on the initial readthrough?

import pandas as pd
fields = ['Bob', 'Alice']
type = ['Monthly','Quarterly']


df = pd.read_excel('data.xlsx', sheet='Sheet1', usecols=fields)
# See the keys
print df.keys()
# See content in 'Bob'
print df.bob

Alternatively, is there a way I can read all 4 columns - Bob and Alice - and then only keep the one I want (e.g. monthly for Bob, quarterly for Alice)?

Example xlsx file is as follows (formatted as a csv to make it look nicer here though):

Mnemonic:,Alice,Bob,Mnemonic:,Alice,Bob
Description:,Test results for Alice,Test results for Bob,Description:,Test results for Alice,Test results for Bob
Source:,(na),(na),Source:,(na),(na)
Native Frequency:,Monthly,Monthly,Native Frequency:,Quarterly,Quarterly
Transformation:,None,None,Transformation:,None,None
Begin Date:,10/31/2006,10/31/2006,Begin Date:,09/30/2006,09/30/2006
Last Updated:,,,Last Updated:,,
Historical End Date:,12/30/2017,12/30/2017,Historical End Date:,12/30/2017,12/30/2017
Geography:,(na),(na),Geography:,(na),(na)
10/31/2006,3,2,09/30/2006,3,2
11/30/2006,3,2,12/31/2006,5,1
12/31/2006,3,2,03/31/2007,7,4
01/31/2007,5,1,06/30/2007,8,7
02/28/2007,5,1,09/30/2007,1,2
03/31/2007,5,1,12/31/2007,6,9
04/30/2007,7,4,03/31/2008,1,5
05/31/2007,7,4,06/30/2008,9,7
06/30/2007,7,4,09/30/2008,9,2
07/31/2007,8,7,12/31/2008,8,7
08/31/2007,8,7,03/31/2009,5,8
09/30/2007,8,7,06/30/2009,3,6

score 0 · Answer 1 · answered Feb 05 '18 at 13:19

0

There isn't an option to filter the rows before the Excel file is loaded into a pandas object.

If your file was in csv format, you could iterate through chunks of the csv file and perform the filtering for each chunk. After this, you would then aggregate the chunks into one dataframe. See this answer for details.

answered Feb 05 '18 at 13:19

jpp

159,742
34
281
339

What about filtering after it's in a `pandas` object? Can I load ALL the rows (including the headers), then delete the columns I don't want? – Aaraeus Feb 05 '18 at 13:40
@Aaraeus, of course! would you mind amending your question accordingly? that way other users can find a problem-solution combo easily. – jpp Feb 05 '18 at 13:56
@Aaraeus, at the same time please provide a sample of the data so we can test. – jpp Feb 05 '18 at 14:01
Just did both. Thanks in advance! – Aaraeus Feb 05 '18 at 14:55

Using pandas to read certain columns from an xlsx, with a condition from another row

1 Answers1