I have an xlsx where the first 9 rows are headers. Row 1 contains a name, like "Bob" and "Alice".
Row 4 contains either 'Monthly' or 'Quarterly'.
Sometimes there are two fields called 'Bob' but one has 'Monthly' and the other has 'Quarterly' in row 4.
I understand I could read in the column called 'Bob' into a dataframe, but is there a way to specify which one should be loaded into the dataframe?
e.g. below I have bob and alice, and as it stands I would read in 2 Bob fields and 2 Alice fields. Is there a way of reducing these somehow on the initial readthrough?
import pandas as pd
fields = ['Bob', 'Alice']
type = ['Monthly','Quarterly']
df = pd.read_excel('data.xlsx', sheet='Sheet1', usecols=fields)
# See the keys
print df.keys()
# See content in 'Bob'
print df.bob
Alternatively, is there a way I can read all 4 columns - Bob and Alice - and then only keep the one I want (e.g. monthly for Bob, quarterly for Alice)?
Example xlsx file is as follows (formatted as a csv to make it look nicer here though):
Mnemonic:,Alice,Bob,Mnemonic:,Alice,Bob
Description:,Test results for Alice,Test results for Bob,Description:,Test results for Alice,Test results for Bob
Source:,(na),(na),Source:,(na),(na)
Native Frequency:,Monthly,Monthly,Native Frequency:,Quarterly,Quarterly
Transformation:,None,None,Transformation:,None,None
Begin Date:,10/31/2006,10/31/2006,Begin Date:,09/30/2006,09/30/2006
Last Updated:,,,Last Updated:,,
Historical End Date:,12/30/2017,12/30/2017,Historical End Date:,12/30/2017,12/30/2017
Geography:,(na),(na),Geography:,(na),(na)
10/31/2006,3,2,09/30/2006,3,2
11/30/2006,3,2,12/31/2006,5,1
12/31/2006,3,2,03/31/2007,7,4
01/31/2007,5,1,06/30/2007,8,7
02/28/2007,5,1,09/30/2007,1,2
03/31/2007,5,1,12/31/2007,6,9
04/30/2007,7,4,03/31/2008,1,5
05/31/2007,7,4,06/30/2008,9,7
06/30/2007,7,4,09/30/2008,9,2
07/31/2007,8,7,12/31/2008,8,7
08/31/2007,8,7,03/31/2009,5,8
09/30/2007,8,7,06/30/2009,3,6