Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

Question

I have a tab separated file with a column that should be interpreted as a string, but many of the entries are integers. With small files read_csv correctly interprets the column as a string after seeing some non integer values, but with larger files, this doesnt work:

import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA')
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique()
for a in df2['a'][262140:262150]:
    print repr(a)

output:

['1' 'X' 1]
'1'
'1'
'1'
'1'
1
1
1
1
1
1

Interestingly 262144 is a power of 2 so I think inference and conversion is happening in chunks but is skipping some chunks.

I am fairly certain this is a bug, but would like a work around that perhaps uses quoting, though adding quoting=csv.QUOTE_NONNUMERIC for reading and writing does not fix the problem. Ideally I could work around this by quoting my string data and somehow force pandas to not do any inference on quoted data.

Using pandas 0.12.0

The [docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html) make it look like this would work: `pd.read_csv('test', sep='\t', converters={'a':str})`. — Steven Rumbalski, Aug 27 '13 at 17:54
@StevenRumbalski and it totally does! You should add this as an answer! — Andy Hayden, Aug 27 '13 at 17:58

Steven Rumbalski · Answer 1 · 2013-08-27T19:25:16.760

To avoid having Pandas infer your data type, provide a converters argument to read_csv:

converters : dict. optional

Dict of functions for converting values in certain columns. Keys can either be integers or column labels

For your file this would look like:

df2 = pd.read_csv('test', sep='\t', converters={'a':str})

My reading of the docs is that you do not need to specify converters for every column. Pandas should continue to infer the datatype of unspecified columns.

score 5 · Accepted Answer · edited May 23 '17 at 12:19

5

You've tricked the read_csv parser here (and to be fair, I don't think it can always be expected to output correctly no matter what you throw at it)... but yes, it could be a bug!

As @Steven points out you can use the converters argument of read_csv:

df2 = pd.read_csv('test', sep='\t', converters={'a': str})

A lazy solution is just to patch this up after you've read in the file:

In [11]: df2['a'] = df2['a'].astype('str')

# now they are equal
In [12]: pd.util.testing.assert_frame_equal(df, df2)

Note: If you are looking for a solution to store DataFrames, e.g. between sessions, both pickle and HDF5Store are excellent solutions which won't be affected by these type of parsing bugs (and will be considerably faster). See: How to store data frame using PANDAS, Python

edited May 23 '17 at 12:19

Community

1
1

answered Aug 27 '13 at 17:40

Andy Hayden

359,921
101
625
535

this is my fallback but involves extra lines of code every time I read a file which I was trying to avoid – andrew Aug 27 '13 at 17:44
solution is probably not to use to_csv/read_csv to store you DataFrames, to_pickle or hdf5_store are much better solutions (and neither will be affected by this kind of parsing bug). – Andy Hayden Aug 27 '13 at 17:45
the longer term solution is to migrate to HDF5Store as you say – andrew Aug 27 '13 at 18:10

Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

2 Answers2

Linked