1

I have a large gzip file which I would like to import into a pandas dataframe. Unfortunately, the file has an uneven number of columns. The data has roughly this format:

.... Col_20: 25    Col_21: 23432    Col22: 639142
.... Col_20: 25    Col_22: 25134    Col23: 243344
.... Col_21: 75    Col_23: 79876    Col25: 634534    Col22: 5    Col24: 73453
.... Col_20: 25    Col_21: 32425    Col23: 989423
.... Col_20: 25    Col_21: 23424    Col22: 342421    Col23: 7    Col24: 13424    Col 25: 67
.... Col_20: 95    Col_21: 32121    Col25: 111231

As a test, I tried this:

import pandas as pd
filename = `path/to/filename.gz`

for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python'):
    print(chunk)

Here is the error I get in return:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 795, in __next__
    return self.get_chunk()
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 836, in get_chunk
    return self.read(nrows=size)
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 1761, in read
    alldata = self._rows_to_cols(content)
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/io/parsers.py", line 2166, in _rows_to_cols
    raise ValueError(msg)
ValueError: Expected 18 fields in line 28, saw 22

How can you allocate a certain number of columns for pandas.read_csv()?

ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
  • 1
    Your problem is some malformed csv's it has nothing to do with pre-allocating the number of columns, you need to do some additional debugging to find the specific file and line that is malformed, you should post a link to the csv or a small sample that reproduces the error – EdChum Sep 08 '16 at 12:58
  • @EdChum It isn't just one line---this file has virtually every row like this. Some rows may have 20 columns, the next 28. What is the correct way to proceed? – ShanZhengYang Sep 08 '16 at 13:00
  • I can't answer hypothetical questions without seeing concrete data, post your data it should have regular delimiters and form, if not then you need to clean the data first – EdChum Sep 08 '16 at 13:01
  • @EdChum Sorry, you're right. The data is tab-delimited text format, but the number of columns per row is inconsistent. It is always tab-delimited text though. The motivation for my question is really, how do I deal with this? Especially for very large data files, e.g. 100s of GB? – ShanZhengYang Sep 08 '16 at 13:10
  • 1
    is it fixed width? the size of the file is irrelevant here, you have an issue with malformed data. How was this file generated? – EdChum Sep 08 '16 at 13:14
  • http://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python – Quazi Marufur Rahman Sep 08 '16 at 13:21
  • @qmaruf What would you recommend if the columns are not in the correct order? – ShanZhengYang Sep 08 '16 at 14:55
  • @ShanZhengYang Why does the order matter? All you have to do is adding the correct number of names, that is at least as much as the record with the most columns in the whole file. Do you know that number? Why are you reading this into data frames anyway? It doesn't seem to look like a format which benefits from that type. – BlackJack Sep 08 '16 at 15:10
  • @BlackJack I don't understand. This is a malformed "tsv" file, which has a different number of columns each row, in a different order. Even if I add the correct number of names, which would this correspond to the correct values? – ShanZhengYang Sep 08 '16 at 16:43
  • @ShanZhengYang The names themselves are completely irrelevant, it's just the number of names to tell the reader before hand how many columns there are at maximum so it doesn't read the first line and assumes each line has that number of columns. Do you know how many columns you expect at max in the file? – BlackJack Sep 08 '16 at 16:49
  • @BlackJack "Do you know how many columns you expect at max in the file?" Yes. Though, I'm not quite following your logic yet---do you have examples/code I could see? – ShanZhengYang Sep 08 '16 at 16:54
  • @ShanZhengYang See the accepted answer in the link qmaruf gave in a comment above. – BlackJack Sep 09 '16 at 18:37

1 Answers1

1

You could also try this:

for chunk in pd.read_csv(filename, sep='\t', chunksize=10**5, engine='python', error_bad_lines=False):
print(chunk)

error_bad_lines would skip bad lines thought. I will see if a better alternative can be found

EDIT: In order to maintain the lines that were skipped by error_bad_lines we can go through the error and add it back to the dataframe

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'
SerialDev
  • 2,777
  • 20
  • 34