1

I have an excel file along the lines of

       gdp     gdp (2009)


1929   104.6   1056.7
1930   173.6   962.0
1931   72.3    846.6

I want to read in the file and specify that the first column (which as no header information) is an integer. I don't need column B

I am reading in the file using the following

import pandas as pd
from pandas import ExcelFile

gdp = pd.read_excel('gdpfile.xls, skiprows = 2, parse_cols = "A,C")

This reads in fine, except the years all get turned into floats, e.g. 1929.0, 1930.0, 1931.0. The first two rows are NaN.

I want to specify that it should be integer. I have tried adding converters = {"A":int,"C":float} in the read_excel command, as suggested by Python pandas: how to specify data types when reading an Excel file? but this did not fix things.

I have tried to convert after the fact, which I've previously done to convert strings to float, however this also did not work.

gdp.columns = ['Year','GDP 2009']
gdp['Year'] = gdp['Year'].astype(int)

I also tried using dtypes = int as suggested in one of the comments at the above link, however this also does not work.

Note that the skiprows is necessary as my actual excel file has a few rows at the top I do not want.

Esme_
  • 1,360
  • 3
  • 18
  • 30

1 Answers1

0

As per the sample given here, two blank rows are present after the heading. So if you want heading, you can give skip rows in range:

pd.read_excel("test.xls",parse_cols="A,C",skiprows=[1,2])

Also, can you please confirm if there are any other NaN cells in that column. If there are NaN values in the column, column dtype will be promoted to float. Please see the link below: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

Also please note that since the first column heading is not given, while importing it takes first column as index. To avoid that, I have followed the below steps:

My excel file looks like this

    NaN       gdp   gdp (2009)
    NaN       NaN   NaN
    NaN       NaN   NaN
    1929    104.6   1056.7
    1930    173.6   962
    1931    72.3    846.6
    NaN     NaN     NaN
    1952    45.3    56.6

I removed the default headers and added headers to avoid indexing issue:

 test = pd.read_excel("test.xls",skiprows=[0,3],header=None,names=['Year','gdp (2009)'],parse_cols="A,C")

As stated above, since the column contains NaN value, column type will be converted into float.You can dropna or fill na values with 0 or some other value. In this case I'm dropping na rows.

test = test.dropna(axis=0, how='all')

Once you have removed NaN values, you can use astype to convert it into int

test['Year']=test.Year.astype(int)

Please check if this works for you and let me know if you need more clarification on this. Thanks,

kiranrag
  • 91
  • 3
  • 1
    Other NaNs was the problem - the file was reading empty lines below my data. Most likely due to the fact that other columns were longer. I didn't have the problem with the first column becoming the index. I ended up using ```gdp = pd.read_excel('gdpfile.xls, skiprows = 2, parse_cols = "A,C", header = None).iloc[:15] gdp.columns = ['Year','GDP 2009'] gdp['Year'] = gdp['Year'].astype(int)``` – Esme_ Jan 18 '18 at 23:04