67

I am reading from an Excel sheet and I want to read certain columns: column 0 because it is the row-index, and columns 22:37. Now here is what I do:

import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = 37)
df= pd.concat([df[df.columns[0]], df[df.columns[22:]]], axis=1)

But I would hope there is better way to do that! I know if I do parse_cols=[0, 22,..,37] I can do it, but for large datasets this doesn't make sense.

I also did this:

s = pd.Series(0)
s[1]=22
for i in range(2,14):
    s[i]=s[i-1]+1
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = s)

But it reads the first 15 columns which is the length of s.

MartyIX
  • 27,828
  • 29
  • 136
  • 207
Ana
  • 1,516
  • 3
  • 15
  • 26
  • 1
    you'd have to generate a list of cols and pass this to `parse_cols` e.g. `parse_cols=[0, 22,23,24.....,37]` rather than what you're doing now – EdChum Nov 11 '15 at 16:30
  • Not sure why that didn't work, it could be a bug, what happens when you pass a hard coded list: `df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = [0,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37)`? – EdChum Nov 13 '15 at 09:31
  • @EdChum, It works if pass a hard coded list. – Ana Nov 13 '15 at 17:34
  • In that case generate a list rather than a series – EdChum Nov 13 '15 at 17:44

6 Answers6

92

You can use column indices (letters) like this:

import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols="A,C:AA")
print(df)

Corresponding documentation:

usecols : int, str, list-like, or callable default None

  • If None, then parse all columns.

  • If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.

  • If list of int, then indicates list of column numbers to be parsed.

  • If list of string, then indicates list of column names to be parsed.

    New in version 0.24.0.

  • If callable, then evaluate each column name against it and parse the column if the callable returns True.

Returns a subset of the columns according to behavior above.

New in version 0.24.0.

tdy
  • 36,675
  • 19
  • 86
  • 83
MartyIX
  • 27,828
  • 29
  • 136
  • 207
  • 12
    It should be noted that "names" should be read as "names in excel", not those you could choose or use as headers. The docs are not clear about this but it is worth mentionning, it gave me some headaches. – Ando Jurai Jun 02 '17 at 12:16
22

parse_cols is deprecated, use usecols instead

that is:

df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols = "A,C:AA")
Georgy
  • 12,464
  • 7
  • 65
  • 73
Leoli
  • 719
  • 1
  • 9
  • 18
  • Note also this bug/unexpected behavior, which I ran into today. https://github.com/pandas-dev/pandas/issues/18273 Looks like using column _names_ does not work with Excel... – Evan Nov 09 '18 at 18:35
19

"usecols" should help, use range of columns (as per excel worksheet, A,B...etc.) below are the examples

1. Selected Columns

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A,C,F")

2. Range of Columns and selected column

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H")

3. Multiple Ranges

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H,J:N")

4. Range of columns

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:N")
Uday Kiran
  • 659
  • 8
  • 8
  • 1
    any ideas for limiting columns by number? – rluts Jun 25 '20 at 09:54
  • 1
    @rluts, replace `usecols="A,C,F"`, with `usecols=[0,2,5]`, incase of range of column numbers, `usecols=range(2,9)`, depending on the requirement please replace the numbers. – Uday Kiran Jun 27 '20 at 12:39
8

If you know the names of the columns and do not want to use A,B,D or 0,4,7. This actually works

df = pd.read_excel(url)[['name of column','name of column','name of column','name of column','name of column']]

where "name of column" = columns wanted. Case and whitespace sensitive

user2557522
  • 131
  • 1
  • 6
2

Read any column's data in excel

import pandas as pd


name_of_file =  "test.xlsx"
data = pd.read_excel(name_of_file)

required_colum_name = "Post test Number"
print(data[required_colum_name])
Mounesh
  • 561
  • 5
  • 18
0

Unfortunately these methods still seem to read and convert the headers before returning the subselection. I have an Excel sheet with duplicate header names because the sheet contains several similar tables. I want to read those tables individually, so I would want to apply usecols. However, this still add suffixes to the duplicate column names.

To reproduce:

  • create an Excel sheet with headers named Header1, Header2, Header1, Header2 under columns A, B, C, D
  • df.read_excel(filename, usecols='C:D')

df.columns will return ['Header1.1', 'Header2.1']

Is there way to circumvent this, aside from splitting and joining the resulting headers? Especially when it is unknown whether there are duplicate columns it is tricky to rename them, as splitting on '.' may be corrupting a non-duplicate header.

Edit: additionally, the length (in indeces) of a DataFrame based on a subset of columns will be determined by the length of the full file. So if column A has 10 rows, and column B only has 5, a DataFrame generated by usecols='B' will have 10 rows of which 5 filled with NaN's.

StephanT
  • 649
  • 5
  • 12