how to read certain columns from Excel using Pandas - Python

Question

I am reading from an Excel sheet and I want to read certain columns: column 0 because it is the row-index, and columns 22:37. Now here is what I do:

import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = 37)
df= pd.concat([df[df.columns[0]], df[df.columns[22:]]], axis=1)

But I would hope there is better way to do that! I know if I do parse_cols=[0, 22,..,37] I can do it, but for large datasets this doesn't make sense.

I also did this:

s = pd.Series(0)
s[1]=22
for i in range(2,14):
    s[i]=s[i-1]+1
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = s)

But it reads the first 15 columns which is the length of s.

you'd have to generate a list of cols and pass this to `parse_cols` e.g. `parse_cols=[0, 22,23,24.....,37]` rather than what you're doing now — EdChum, Nov 11 '15 at 16:30
Not sure why that didn't work, it could be a bug, what happens when you pass a hard coded list: `df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = [0,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37)`? — EdChum, Nov 13 '15 at 09:31

score 92 · Answer 1 · edited Dec 05 '21 at 07:39

You can use column indices (letters) like this:

import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols="A,C:AA")
print(df)

Corresponding documentation:

usecols : int, str, list-like, or callable default None

If None, then parse all columns.

If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.

If list of int, then indicates list of column numbers to be parsed.

If list of string, then indicates list of column names to be parsed.

New in version 0.24.0.

If callable, then evaluate each column name against it and parse the column if the callable returns True.

Returns a subset of the columns according to behavior above.

New in version 0.24.0.

It should be noted that "names" should be read as "names in excel", not those you could choose or use as headers. The docs are not clear about this but it is worth mentionning, it gave me some headaches. — Ando Jurai, Jun 02 '17 at 12:16

score 22 · Answer 2 · edited Feb 15 '19 at 09:43

22

parse_cols is deprecated, use usecols instead

that is:

df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols = "A,C:AA")

edited Feb 15 '19 at 09:43

Georgy

12,464
7
65
73

answered Mar 23 '18 at 04:57

Leoli

719
1
9
18

Note also this bug/unexpected behavior, which I ran into today. https://github.com/pandas-dev/pandas/issues/18273 Looks like using column _names_ does not work with Excel... – Evan Nov 09 '18 at 18:35

Uday Kiran · Answer 3 · 2021-07-28T20:01:20.037

19

"usecols" should help, use range of columns (as per excel worksheet, A,B...etc.) below are the examples

1. Selected Columns

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A,C,F")

2. Range of Columns and selected column

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H")

3. Multiple Ranges

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H,J:N")

4. Range of columns

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:N")

edited Jul 28 '21 at 20:01

answered Apr 05 '20 at 09:46

Uday Kiran

659
8
8

1

any ideas for limiting columns by number? – rluts Jun 25 '20 at 09:54
1

@rluts, replace `usecols="A,C,F"`, with `usecols=[0,2,5]`, incase of range of column numbers, `usecols=range(2,9)`, depending on the requirement please replace the numbers. – Uday Kiran Jun 27 '20 at 12:39

score 8 · Answer 4 · answered Jun 23 '22 at 20:28

If you know the names of the columns and do not want to use A,B,D or 0,4,7. This actually works

df = pd.read_excel(url)[['name of column','name of column','name of column','name of column','name of column']]

where "name of column" = columns wanted. Case and whitespace sensitive

score 2 · Answer 5 · answered Sep 11 '22 at 12:57

2

Read any column's data in excel

import pandas as pd


name_of_file =  "test.xlsx"
data = pd.read_excel(name_of_file)

required_colum_name = "Post test Number"
print(data[required_colum_name])

answered Sep 11 '22 at 12:57

Mounesh

561
5
18

StephanT · Answer 6 · 2022-12-05T16:15:53.920

Unfortunately these methods still seem to read and convert the headers before returning the subselection. I have an Excel sheet with duplicate header names because the sheet contains several similar tables. I want to read those tables individually, so I would want to apply usecols. However, this still add suffixes to the duplicate column names.

To reproduce:

create an Excel sheet with headers named Header1, Header2, Header1, Header2 under columns A, B, C, D
df.read_excel(filename, usecols='C:D')

df.columns will return ['Header1.1', 'Header2.1']

Is there way to circumvent this, aside from splitting and joining the resulting headers? Especially when it is unknown whether there are duplicate columns it is tricky to rename them, as splitting on '.' may be corrupting a non-duplicate header.

Edit: additionally, the length (in indeces) of a DataFrame based on a subset of columns will be determined by the length of the full file. So if column A has 10 rows, and column B only has 5, a DataFrame generated by usecols='B' will have 10 rows of which 5 filled with NaN's.

how to read certain columns from Excel using Pandas - Python

6 Answers6

Read any column's data in excel

Linked

Related