Skip specific set of columns when reading excel frame - pandas

Question

I know beforehand what columns I don't need from an excel file and I'd like to avoid them when reading the file to improve the performance. Something like this:

import pandas as pd
df = pd.read_excel('large_excel_file.xlsx', skip_cols=['col_a', 'col_b',...,'col_zz'])

There is nothing related to this in the documentation. Is there any workaround for this?

@Aran-Fey It is possible but the list of columns to use would be significantly large compared with the unused columns list (160 vs 30) — Juan David, Apr 05 '18 at 16:37
do you know the indices (positions) of columns that you want to skip? — MaxU - stand with Ukraine, Apr 05 '18 at 16:44
@MaxU Yes. I can determine the indices of the columns to skip — Juan David, Apr 05 '18 at 17:00
duplicate of https://stackoverflow.com/questions/24366449/python-pandas-how-to-skip-columns-when-reading-a-file — MarMat, May 22 '19 at 06:55

score 26 · Answer 1 · answered May 22 '19 at 08:28

26

If your version of pandas allows (check first if you can pass a function to usecols), I would try something like:

import pandas as pd
df = pd.read_excel('large_excel_file.xlsx', usecols=lambda x: 'Unnamed' not in x,)

This should skip all columns without header names. You could substitute 'Unnamed' with a list of column names you do not want.

answered May 22 '19 at 08:28

MarMat

790
8
12

2

Note that `usecols` accept the columns letters as parameter: usecols = "A,C:AA" – neves Jun 11 '20 at 00:16
Note also that `usecols` can't be used when a multi-index header is specified (pandas v1.4.2) – philipnye Jun 09 '22 at 04:14

score 16 · Accepted Answer · edited Aug 09 '20 at 09:00

16

You can use the following technique. Let the columns we don't want(want to skip) are 2 5 8, then find all reamining columns we DO WANT TO KEEP as cols such that:

In [7]: cols2skip = [2,5,8]  
In [8]: cols = [i for i in range(10) if i not in cols2skip]

In [9]: cols
Out[9]: [0, 1, 3, 4, 6, 7, 9]

and then we can use those remaining columns(which we DO WANT TO KEEP) using usecols:

df = pd.read_excel(filename, usecols=cols)

edited Aug 09 '20 at 09:00

Anurag Dhadse

1,722
1
13
26

answered Apr 05 '18 at 17:14

MaxU - stand with Ukraine

205,989
36
386
419

1

I think this is more 'Pythonic' than @MarMat, as this uses readable list comprehension in 2 lines, and other uses lambda. My understanding is always avoid lambda in Python if you can use a list comprehension, and lambda is rarely much faster. If you want someone else to understand your code quicker, this will be easier imho. If you are processing Excel and you find one of columns is binary image string (I get that surprisingly often), this is quite useful! – Will Croxford Feb 15 '21 at 17:19

score 0 · Answer 3 · answered Feb 18 '23 at 07:51

If you're using read_excel, it's certainly easier to use a function to filter columns as in MarMat's answer. Their answer should be changed a bit to fit the case in the OP where skip_cols should be left out.

skip_cols = ['col_a', 'col_b',...,'col_zz']
df = pd.read_excel('Book1.xlsx', usecols=lambda x: x not in skip_cols)

You can also read only the columns of an Excel file by passing nrows=0 and use Index.difference to extract the columns you need.

cols = pd.read_excel('Book1.xlsx', nrows=0).columns
df = pd.read_excel('Book1.xlsx', usecols=cols.difference(skip_cols))

Skip specific set of columns when reading excel frame - pandas

3 Answers3

Linked