2

It's like how to read certain columns from Excel using Pandas - Python but a little bit more complicated.

Say I have an Excel file called "foo.xlsx" and it grows over time - a new column will be appended on the right every month. However, when I read it, I need only the first two and the last columns. I expected usecols parameter can solve this problem so I went df = pd.read_excel("foo.xlsx", usecols=[0, 1, -1]) but it gives me only the first two columns.

My workaround turns out to be:

df = pd.read_excel("foo.xlsx")
df = df[df.columns[[0, 1, -1]]]

But it needs reading the whole file every time. Is there any way that I can get my desired data frame while reading the file? Thanks.

ytu
  • 1,822
  • 3
  • 19
  • 42
  • @Nihal Could you please elaborate your points? `pandas` does support reading the file with specified columns, but I am asking how to always get the last one. – ytu Feb 21 '19 at 07:33
  • 1
    I suppose specifying the `usecols` argument won't save a lot of time because `read_excel` always reads the whole sheet. `usecols` just skips the rest after heaving read everything and will only accelerate parsing . – JoergVanAken Feb 21 '19 at 08:04

3 Answers3

1

One idea is get column count and pass to usecols:

from openpyxl import load_workbook

path = "file.xlsx"

wb = load_workbook(path)
sheet = wb.worksheets[0]
column_count = sheet.max_column
print (column_count)

Or read only first row of file:

column_count  = len(pd.read_excel(path, nrows=0).columns)

df = pd.read_excel(path, usecols=[0, 1, column_count-1])
print (df)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

If you really want to do this (see my comment above) you could to this:

xl = pd.ExcelFile(file)
ncols = xl.book.sheets()[0].ncols
df = xl.parse(0, usecols=[0, 1, ncols-1])

This solution won't read the excel file twice.

JoergVanAken
  • 1,286
  • 9
  • 10
  • Just out of curiosity, doesn't `xl.book.sheets()[0].ncols` need to read the whole Excel file to get the answer? I checked [ExcelFile Vs. read_excel in pandas](https://stackoverflow.com/a/49991054/6666231) and I wonder how this solution can be better. – ytu Feb 21 '19 at 09:55
  • 1
    Yes, it has to read the whole file. But the following parse command doesn't have to id again. So overall not much has been won, but nothing is done twice (in contrast to jezreal's answer) – JoergVanAken Feb 21 '19 at 09:59
-1

You can use df.head() and df.tail() to read the first 2 and last line. For example:

df = pd.read_excel("foo.xlsx", sheet_name='ABC')
#print the first 2 column
print(df.head(2))
#print the last column
print(df.tail(1))

EDIT: Oops the above code reads rows and not columns. Yes, you have to read the file everytime. I don't think there's an option to read partial file.

For reading column maybe you can do something like this

df['Column Name'][index]
Arjun
  • 21
  • 5
  • Your answer provides rows, not columns. Also, it cannot prevent reading the whole file while the file is getting larger over time. – ytu Sep 30 '20 at 08:02