21

I know beforehand what columns I don't need from an excel file and I'd like to avoid them when reading the file to improve the performance. Something like this:

import pandas as pd
df = pd.read_excel('large_excel_file.xlsx', skip_cols=['col_a', 'col_b',...,'col_zz'])

There is nothing related to this in the documentation. Is there any workaround for this?

cottontail
  • 10,268
  • 18
  • 50
  • 51
Juan David
  • 2,676
  • 4
  • 32
  • 42

3 Answers3

26

If your version of pandas allows (check first if you can pass a function to usecols), I would try something like:

import pandas as pd
df = pd.read_excel('large_excel_file.xlsx', usecols=lambda x: 'Unnamed' not in x,)

This should skip all columns without header names. You could substitute 'Unnamed' with a list of column names you do not want.

MarMat
  • 790
  • 8
  • 12
16

You can use the following technique. Let the columns we don't want(want to skip) are 2 5 8, then find all reamining columns we DO WANT TO KEEP as cols such that:

In [7]: cols2skip = [2,5,8]  
In [8]: cols = [i for i in range(10) if i not in cols2skip]

In [9]: cols
Out[9]: [0, 1, 3, 4, 6, 7, 9]

and then we can use those remaining columns(which we DO WANT TO KEEP) using usecols:

df = pd.read_excel(filename, usecols=cols)
Anurag Dhadse
  • 1,722
  • 1
  • 13
  • 26
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 1
    I think this is more 'Pythonic' than @MarMat, as this uses readable list comprehension in 2 lines, and other uses lambda. My understanding is always avoid lambda in Python if you can use a list comprehension, and lambda is rarely much faster. If you want someone else to understand your code quicker, this will be easier imho. If you are processing Excel and you find one of columns is binary image string (I get that surprisingly often), this is quite useful! – Will Croxford Feb 15 '21 at 17:19
0

If you're using read_excel, it's certainly easier to use a function to filter columns as in MarMat's answer. Their answer should be changed a bit to fit the case in the OP where skip_cols should be left out.

skip_cols = ['col_a', 'col_b',...,'col_zz']
df = pd.read_excel('Book1.xlsx', usecols=lambda x: x not in skip_cols)

You can also read only the columns of an Excel file by passing nrows=0 and use Index.difference to extract the columns you need.

cols = pd.read_excel('Book1.xlsx', nrows=0).columns
df = pd.read_excel('Book1.xlsx', usecols=cols.difference(skip_cols))
cottontail
  • 10,268
  • 18
  • 50
  • 51