127

I have a csv file which isn't coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd
csv = r"""dummy,date,loc,x
   bar,20090101,a,1
   bar,20090102,a,3
   bar,20090103,a,5
   bar,20090101,b,1
   bar,20090102,b,3
   bar,20090103,b,5"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df1 = pd.read_csv('foo.csv',
        header=0,
        names=["dummy", "date", "loc", "x"], 
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"])
print df1

# Ignore the dummy columns
df2 = pd.read_csv('foo.csv', 
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"], # <----------- Changed
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
print df2

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py
               dummy  x
date       loc
2009-01-01 a     bar  1
2009-01-02 a     bar  3
2009-01-03 a     bar  5
2009-01-01 b     bar  1
2009-01-02 b     bar  3
2009-01-03 b     bar  5
              date
date loc
a    1    20090101
     3    20090102
     5    20090103
b    1    20090101
     3    20090102
     5    20090103

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.

edit: fixed bad header usage.

smci
  • 32,567
  • 20
  • 113
  • 146
chip
  • 2,262
  • 2
  • 20
  • 24
  • 1
    Something else, your usage of the `header` and `names` keywords is not correct (that's why the first row is missing in your example. `header` expects an int (default 0) as the row with the header. Because you give 'True' which is interpreted as 1, the second row (first data row) is used as the header and is missing. However the column names are correct because you overwrite it with the `names` argument. But you can both leave them and the first row is used for the column names by default. However, it does not solve your initial question. – joris Feb 22 '13 at 08:47
  • 1
    This looks like a `usecols` bug. Possibly related to [bug 2654](https://github.com/pydata/pandas/issues/2654)? – abudis Feb 22 '13 at 11:37
  • bug is still there without names and header arguments, good find. – Andy Hayden Feb 22 '13 at 12:20
  • @andy I'll poke at it a little more and submit it to the pandas bugs. I appreciate the sanity check. – chip Feb 22 '13 at 17:37

4 Answers4

149

The solution lies in understanding these two keyword arguments:

  • names is only necessary when there is no header row in your file and you want to specify other arguments (such as usecols) using column names rather than integer indices.
  • usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

So because you have a header row, passing header=0 is sufficient and additionally passing names appears to be confusing pd.read_csv.

Removing names from the second call gives the desired output:

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        header=0,
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5
Mack
  • 2,614
  • 2
  • 21
  • 33
  • 1
    This is the textbook solution for parsing the CSV data, but at the time I was intent on using the *names* argument because the real data had no header. – chip Jan 06 '15 at 17:11
  • 2
    In that case, you wouldn't specify `header=0`. You'd want to use `header=None` and then use `names` in addition. – Mack Jan 06 '15 at 18:32
  • But still use `usecols` with integer indexes for the columns one wants to keep @Mack? – Mr_and_Mrs_D Dec 03 '18 at 05:23
  • @Mr_and_Mrs_D in the case we're discussing, you'd only need to use integer indices if you don't pass `names`. You can use indices in `usecols` either way, but if you're going to pass `names` anyways (which you'd still need to do if you want to specify `parse_dates` and `index_col` using those names), you may as well use the names in `usecols`, since you're less likely to get them wrong. – Mack Jan 04 '21 at 16:18
25

This code achieves what you want --- also its weird and certainly buggy:

I observed that it works when:

a) you specify the index_col rel. to the number of columns you really use -- so its three columns in this example, not four (you drop dummy and start counting from then onwards)

b) same for parse_dates

c) not so for usecols ;) for obvious reasons

d) here I adapted the names to mirror this behaviour

import pandas as pd
from StringIO import StringIO

csv = """dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5
"""

df = pd.read_csv(StringIO(csv),
        index_col=[0,1],
        usecols=[1,2,3], 
        parse_dates=[0],
        header=0,
        names=["date", "loc", "", "x"])

print df

which prints

                x
date       loc   
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5
tzelleke
  • 15,023
  • 5
  • 33
  • 49
  • 1
    Thanks. I never figured out the right combination of re-arranging the ``names`` and numbers based on ``usecols`` so the data came in correct. – chip Feb 22 '13 at 23:17
9

If your csv file contains extra data, columns can be deleted from the DataFrame after import.

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
del df['dummy']

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5
chip
  • 2,262
  • 2
  • 20
  • 24
3

You have to just add the index_col=False parameter

df1 = pd.read_csv('foo.csv',
     header=0,
     index_col=False,
     names=["dummy", "date", "loc", "x"], 
     usecols=["dummy", "date", "loc", "x"],
     parse_dates=["date"])
  print df1
Michael Millar
  • 1,364
  • 2
  • 24
  • 46