Pandas ExcelFile.parse() reading file in as dict instead of dataframe

Question

I am new to python and even newer to pandas, but relatively well versed in R. I am using Anaconda, with Python 3.5 and pandas 0.18.1. I am trying to read in an excel file as a dataframe. The file admittedly is pretty... ugly. There is a lot of empty space, missing headers, etc. (I am not sure if this is the source of any issues)

I create the file object, then find the appropriate sheet, then try to read that sheet as a dataframe:

xl = pd.ExcelFile(allFiles[i])
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
df = xl.parse(sName)
df

Results:

{'Security exposure - 21 day lag':          Percent of Total Holdings  \
 0      KMNFC vs. 3 Month LIBOR AUD   
 1                        04-OCT-16   
 2                Australian Dollar   
 3                              NaN   
 4                              NaN   
 5                              NaN   
 6                              NaN   
 7                              NaN   
 8          Long/Short Net Exposure   
 9                            Total   
 10                             NaN   
 11                            Long   
 12                             NaN   
 13                             NaN   
 14                             NaN   
 15                             NaN   
 16                             NaN   
 17                             NaN

(This goes on for 20-30 more rows and 5-6 more columns)

I am using Anaconda, and Spyder, which has a 'Variable Explorer'. It shows the variable df to be a dict of the DataFrame type:

However, I cannot use iloc:

df.iloc[:,1]

Traceback (most recent call last):

  File "<ipython-input-77-d7b3e16ccc56>", line 1, in <module>
    df.iloc[:,1]

AttributeError: 'dict' object has no attribute 'iloc'

Any thoughts? What am I missing?

EDIT:

To be clear, what I am really trying to do is reference the first column of the df. In R this would be df[,1]. Looking around it seems to be not a very popular way to do things, or not the 'correct' way. I understand why indexing by column names, or keys, is better, but in this situation, I really just need to index the dataframes by column numbers. Any working method of doing that would be greatly appreciated.

EDIT (2):

Per a suggestion, I tried 'read_excel', with the same results:

df = pd.ExcelFile(allFiles[i]).parse(sName)
df.loc[1]
Traceback (most recent call last):

  File "<ipython-input-90-fc40aa59bd20>", line 2, in <module>
    df.loc[1]

AttributeError: 'dict' object has no attribute 'loc'


df = pd.read_excel(allFiles[i], sheetname = sName)
df.loc[1]
Traceback (most recent call last):

  File "<ipython-input-91-72b8405c6c42>", line 2, in <module>
    df.loc[1]

AttributeError: 'dict' object has no attribute 'loc'

Dictionaries and dataframes are different. What you have is a dictionary of dataframe(s). When you parse the excel file like that, it creates a dictionary where the keys are the sheet names and the values are the dataframes. You can access the keys via `df.keys()` and then access the dataframe via `df['Security exposure - 21 day lag']` (if that is the key). Any reason you are not using pd.read_excel? — ayhan, Oct 27 '16 at 15:57
The names of the sheets are not constant. I needed to use ExcelFile.sheet_names to test them and find the correct one. I (presumably) could use either 'parse' or 'read_excel', but using 'read_excel' does not seem to change anything. — lukehawk, Oct 27 '16 at 16:00
The cause of the error is that you are trying those methods on a dictionary. First, get the dataframe and then call that method (i.e. `df[sName].iloc[:, 1]`) — ayhan, Oct 27 '16 at 16:06
I think I did that, and it did not work. I understand df is a dict. Why? I was lead to believe 'parse' and 'read_excel' would return dataframes. What is the point of passing 'read_excel' a sheet name, if I still have to use the sheet name to reference that df? I am not at all interested in keeping the rest of the sheets. — lukehawk, Oct 27 '16 at 16:10
sName is a list of one string. Ugh. Goddamnit. That's the problem. I changed sName to be a string andfit worked. Thanks for your help! — lukehawk, Oct 27 '16 at 16:11
That's the reason it is returning a dictionary. It creates a mapping of {sheetname: dataframe} You can pass `sName[0]` instead. It will return a dataframe then (at least with pd.read_excel). — ayhan, Oct 27 '16 at 16:12

score 1 · Answer 1 · edited May 23 '17 at 10:34

1

The problem was here:

sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]

which returned a single element list. I changed it to the following:

sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()][0]

which returns a string, and the code then performs as expected.

All thanks to ayhan for pointing this out.

edited May 23 '17 at 10:34

Community

1
1

answered Oct 27 '16 at 16:15

lukehawk

1,423
3
22
48

Pandas ExcelFile.parse() reading file in as dict instead of dataframe

1 Answers1

Linked