7

I have a tab separated file with a column that should be interpreted as a string, but many of the entries are integers. With small files read_csv correctly interprets the column as a string after seeing some non integer values, but with larger files, this doesnt work:

import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA')
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique()
for a in df2['a'][262140:262150]:
    print repr(a)

output:

['1' 'X' 1]
'1'
'1'
'1'
'1'
1
1
1
1
1
1

Interestingly 262144 is a power of 2 so I think inference and conversion is happening in chunks but is skipping some chunks.

I am fairly certain this is a bug, but would like a work around that perhaps uses quoting, though adding quoting=csv.QUOTE_NONNUMERIC for reading and writing does not fix the problem. Ideally I could work around this by quoting my string data and somehow force pandas to not do any inference on quoted data.

Using pandas 0.12.0

smci
  • 32,567
  • 20
  • 113
  • 146
andrew
  • 1,843
  • 20
  • 19
  • 3
    The [docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html) make it look like this would work: `pd.read_csv('test', sep='\t', converters={'a':str})`. – Steven Rumbalski Aug 27 '13 at 17:54
  • @StevenRumbalski and it totally does! You should add this as an answer! – Andy Hayden Aug 27 '13 at 17:58

2 Answers2

6

To avoid having Pandas infer your data type, provide a converters argument to read_csv:

converters : dict. optional

Dict of functions for converting values in certain columns. Keys can either be integers or column labels

For your file this would look like:

df2 = pd.read_csv('test', sep='\t', converters={'a':str})

My reading of the docs is that you do not need to specify converters for every column. Pandas should continue to infer the datatype of unspecified columns.

Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
5

You've tricked the read_csv parser here (and to be fair, I don't think it can always be expected to output correctly no matter what you throw at it)... but yes, it could be a bug!

As @Steven points out you can use the converters argument of read_csv:

df2 = pd.read_csv('test', sep='\t', converters={'a': str})

A lazy solution is just to patch this up after you've read in the file:

In [11]: df2['a'] = df2['a'].astype('str')

# now they are equal
In [12]: pd.util.testing.assert_frame_equal(df, df2)

Note: If you are looking for a solution to store DataFrames, e.g. between sessions, both pickle and HDF5Store are excellent solutions which won't be affected by these type of parsing bugs (and will be considerably faster). See: How to store data frame using PANDAS, Python

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • this is my fallback but involves extra lines of code every time I read a file which I was trying to avoid – andrew Aug 27 '13 at 17:44
  • solution is probably not to use to_csv/read_csv to store you DataFrames, to_pickle or hdf5_store are much better solutions (and neither will be affected by this kind of parsing bug). – Andy Hayden Aug 27 '13 at 17:45
  • the longer term solution is to migrate to HDF5Store as you say – andrew Aug 27 '13 at 18:10