possible inconsistency in text handling of pandas read_table() function

Question

In a previous post, I found out that pandas read_table() function can handle variable-lenth whitespace as a delimiter if you use the read_table('datafile', sep=r'\s*') construction. While this works great for many of my files, it does not work for others despite being highly similar.

EDIT: I had posted examples that could not replicate the problem when other tried. So I am posting links to the original files for AY907538 and AY942707 as well as leaving the error message that I cannot manage to solve.

## filename:AY942707
# this will load with no problem
data = read_table('AY942707.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')

## filename: AY907538
data = read_table('AY907538.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')

which will generate the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-131d10d1fb1d> in <module>()
      2 
      3 #temp = get_dataset('AY907538.hmmdomtblout')
----> 4 data = read_table('AY907538.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')
      5 #data = read_table('AY942707.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_table(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze)
    282     kwds['encoding'] = None
    283 
--> 284     return _read(TextParser, filepath_or_buffer, kwds)
    285 
    286 @Appender(_read_fwf_doc)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(cls, filepath_or_buffer, kwds)
    189         return parser
    190 
--> 191     return parser.get_chunk()
    192 
    193 @Appender(_read_csv_doc)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows)
    779             msg = ('Expecting %d columns, got %d in row %d' %
    780                    (col_len, zip_len, row_num))
--> 781             raise ValueError(msg)
    782 
    783         data = dict((k, v) for k, v in izip(self.columns, zipped_content))

ValueError: Expecting 26 columns, got 28 in row 6

Try to reduce your sample data to something smaller (ideally a single row) that still demonstrates the problem. — BrenBarn, Aug 20 '12 at 02:06
@BrenBarn. Brought it down to just a few entries as I wanted to leave a few for comparison. The bug entry says the sixth row is the issue. Do you think the shorter value - 4.6 - in the i-Evalue could be the issue? — zach, Aug 20 '12 at 02:13
Works for me. So ISTM either something got lost during the posting process or my copying process, or we're using different `pandas` versions. [Mine says 0.8.1, but I can't remember if I'm using a checked-out copy or not.] — DSM, Aug 20 '12 at 02:16
hmmm. Let me check my version and also copy the data I posted to see if I can replicate the error. — zach, Aug 20 '12 at 02:23
@DMS thanks for checking the data. I am 0.8. as well. But when I copy/paste the webpage data I posted back into a text file and read it in, I have no problem either. Do you think this could point to something to do with hard-returns? — zach, Aug 20 '12 at 02:31
added links to the original files so they can be downloaded. — zach, Aug 20 '12 at 02:39
Okay, I can reproduce this. Unfortunately the only way I can get around it is very hacky. Hopefully someone else can come up with something clever (if not, maybe it's worth making a pull request to Wes McKinney.) — DSM, Aug 20 '12 at 03:13
@DSM. Thanks for sticking with this problem. I am happy to bring it to Wes' attention but I am not clear where the problem lies. — zach, Aug 20 '12 at 03:20

score 1 · Accepted Answer · answered Aug 20 '12 at 07:55

1

The last field description of target in both files holds multiple words. Since white space is used as seperator, description of target is not treated as a single column by read_table. Each word in this field is in a different column. In AY942707 the first description of target holds more words than on all of the other lines, this is not the case in AY907538. read_table determines the number of columns from the first line and all following lines should have equal or less number of columns.

answered Aug 20 '12 at 07:55

Wouter Overmeire

65,766
10
63
43

Thanks for looking at this. I thought that finding the answer may prove helpful to pandas but I think in this case a little pre-munging before loading into pandas is the best way to go. Thanks again. – zach Aug 20 '12 at 16:03

possible inconsistency in text handling of pandas read_table() function

1 Answers1