16

I have a very large CSV File with 100 columns. In order to illustrate my problem I will use a very basic example.

Let's suppose that we have a CSV file.

in  value   d     f
0    975   f01    5
1    976   F      4
2    977   d4     1
3    978   B6     0
4    979   2C     0

I want to select a specific columns.

import pandas
data = pandas.read_csv("ThisFile.csv")

In order to select the first 2 columns I used

data.ix[:,:2]

In order to select different columns like the 2nd and the 4th. What should I do?

There is another way to solve this problem by re-writing the CSV file. But it's huge file; So I am avoiding this way.

Racil Hilan
  • 24,690
  • 13
  • 50
  • 55
user3378649
  • 5,154
  • 14
  • 52
  • 76

3 Answers3

21

This selects the second and fourth columns (since Python uses 0-based indexing):

In [272]: df.iloc[:,(1,3)]
Out[272]: 
   value  f
0    975  5
1    976  4
2    977  1
3    978  0
4    979  0

[5 rows x 2 columns]

df.ix can select by location or label. df.iloc always selects by location. When indexing by location use df.iloc to signal your intention more explicitly. It is also a bit faster since Pandas does not have to check if your index is using labels.


Another possibility is to use the usecols parameter:

data = pandas.read_csv("ThisFile.csv", usecols=[1,3])

This will load only the second and fourth columns into the data DataFrame.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks ! One last thing, I got problem while trynig iloc, I got this problem. "IndexError: too many indices" – user3378649 Mar 14 '14 at 02:11
  • You might have gotten that error, "Too many *indexers*", if the parentheses were omitted, as in `df.iloc[:,1,3]`. – unutbu Mar 14 '14 at 09:12
10

If you rather select column by name, you can use

data[['value','f']]

   value  f
0    975  5
1    976  4
2    977  1
3    978  0
4    979  0
Wai Yip Tung
  • 18,106
  • 10
  • 43
  • 47
1

As Wai Yip Tung said, you can filter your dataframe while reading by specifying the name of the columns, for example:

import pandas as pd
data = pd.read_csv("ThisFile.csv")[['value','d']]

This solved my problem.

dasilvadaniel
  • 413
  • 4
  • 8