14

What is the best way to read from a csv, but only one specific column, like title?

ID | date|  title |
-------------------
  1|  2013|   abc |
  2|  2012|   cde |

The column should then be stored in an array like this:

data = ["abc", "cde"]

This is what I have so far, with pandas:

data = pd.read_csv("data.csv", index_col=2)

I've looked into this thread. I still get an IndexError: list index out of range.

EDIT:

It's not a table, it's comma seperated like this:

ID,date,title
1,2013,abc
2,2012,cde
Community
  • 1
  • 1
dh762
  • 2,259
  • 4
  • 25
  • 44
  • Is that really what your file looks like, with `|` and `--` and so on? (Not really CSV, comma-separated-values, but we should be able to read it with some work.) – DSM Jan 11 '14 at 18:09
  • And are you certain that *all* rows have 3 columns? Sounds as if some of your rows are missing that third column, perhaps more. – Martijn Pieters Jan 11 '14 at 18:09

3 Answers3

24

One option is just to read in the entire csv, then select a column:

data = pd.read_csv("data.csv")

data['title']  # as a Series
data['title'].values  # as a numpy array

As @dawg suggests, you can use the usecols argument, if you also use the squeeze argument to avoid some hackery flattening the values array...

In [11]: titles = pd.read_csv("data.csv", sep=',', usecols=['title'], squeeze=True)

In [12]: titles  # Series
Out[12]: 
0    abc
1    cde
Name: title, dtype: object

In [13]: titles.values  # numpy array
Out[13]: array(['abc', 'cde'], dtype=object)
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
6

You can do something like this:

>>> import pandas as pd
>>> from StringIO import StringIO
>>> txt='''\
... ID,date,title
... 1,2013,abc
... 2,2012,cde'''
>>> data=pd.read_csv(StringIO(txt), usecols=['title']).T.values.tolist()[0]
>>> data
['abc', 'cde']

Or, assuming that you have some blanks:

txt='''\
ID,date,title
1,2013,abc
2,2012,cde
3,2014, 
4,2015,fgh'''
table=pd.read_csv(StringIO(txt), usecols=['title'])
print table
  title
0   abc
1   cde
2      
3   fgh
data=pd.read_csv(StringIO(txt), usecols=['title']).T.values.tolist()[0]
print data
['abc', 'cde', ' ', 'fgh']

Or if you have variable number of data fields:

txt='''\
ID,date,title
1,2013,
2,2012,cde
3
4,2015,fgh'''

print pd.read_csv(StringIO(txt), usecols=['title'])
  title
0   NaN
1   cde
2   NaN
3   fgh

print pd.read_csv(StringIO(txt), usecols=['title']).T.values.tolist()[0]
[nan, 'cde', nan, 'fgh']
dawg
  • 98,345
  • 23
  • 131
  • 206
  • I think you assume I have sort of a table, but it is a comma-seperated csv, just presented in the wrong way (Excel-View). – dh762 Jan 11 '14 at 18:27
  • You can use .ravel() or .flatten() on the data to get a flattened numpy array. – Andy Hayden Jan 11 '14 at 19:31
5

Finally, it was much simpler:

import pandas as pd
data = pd.read_csv("mycsv.csv")
data.columns = ["ID", "date", "title"]
rawlist = list(data.title)
dh762
  • 2,259
  • 4
  • 25
  • 44