exception reading in large tab separated file chunked

Question

I have a 350MB tab separated text file. If I try to read it into memory I get an out of memory exception. So I am trying something along those lines (i.e. only read in a few columns):

import pandas as pd

input_file_and_path = r'C:\Christian\ModellingData\X.txt'

column_names = [
    'X1'
    # , 'X2
]
raw_data = pd.DataFrame()
for chunk in pd.read_csv(input_file_and_path, names=column_names, chunksize=1000, sep='\t'):
    raw_data = pd.concat([raw_data, chunk], ignore_index=True)

print(raw_data.head())

Unfortunately, I get this:

Traceback (most recent call last):
  File "pandas\_libs\parsers.pyx", line 1134, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
  File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/xxxx/EdaDataPrepRange1.py", line 17, in <module>
    for chunk in pd.read_csv(input_file_and_path, header=None, names=column_names, chunksize=1000, sep='\t'):
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas\_libs\parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
  File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte

Any ideas. Btw how can I generally deal with large files and impute for example missing variables? Ultimately, I need to read in everything to determine, for example, the median to be imputed.

For the encoding error part, I have written an [answer](https://stackoverflow.com/a/51763708/3545273) on another question that was specific on unicodedecoderrors in pandas csv — Serge Ballesta, Aug 09 '18 at 09:47

Upasana Mittal · Accepted Answer · 2018-08-09T13:29:51.953

3

use encoding="utf-8" while using pd.read_csv

Here they have used this encoding. see if this works. open(file path, encoding='windows-1252'):

Reference: 'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

Working Solution

to use encoding encoding="ISO-8859-1"

edited Aug 09 '18 at 13:29

answered Aug 09 '18 at 09:08

Upasana Mittal

2,480
1
14
19

Tried: for chunk in pd.read_csv(input_file_and_path, names=column_names, chunksize=1000, sep='\t', encoding="utf-8") still get same error )-: – cs0815 Aug 09 '18 at 10:18
@cs0815 This shouldn't be the case though as utf-8 automatically reads ascii character as well. Hope your pandas is updated. Here they have used this encoding. see if this works. `open(file path, encoding='windows-1252')`: https://stackoverflow.com/questions/48067514/utf-8-codec-cant-decode-byte-0xa0-in-position-4276-invalid-start-byte – Upasana Mittal Aug 09 '18 at 10:34
thanks. should be as a just installed anaconda yesterday from scratch. – cs0815 Aug 09 '18 at 10:37
1

encoding="ISO-8859-1" worked for me in the end ... you may want to adapt your answer ... – cs0815 Aug 09 '18 at 13:28
Sure. will do. Thanks :) – Upasana Mittal Aug 09 '18 at 13:29

score 2 · Answer 2 · answered Aug 09 '18 at 09:07

2

Regarding your large file problem, just use a file handler and context manager:

with open("your_file.txt") as fileObject:
    for line in fileObject:
        do_something_with(line)

## No need to close file as 'with' automatically does that

This won't load the whole file into memory. Instead, it'll load a line at a time, and will 'forget' previous lines unless you store a reference.

Also, regarding your encoding problem, just use encoding="utf-8" while using pd.read_csv.

answered Aug 09 '18 at 09:07

Adi219

4,712
2
20
43

but how can you do EDA etc. in this scenario? – cs0815 Aug 09 '18 at 10:05
sorry - exploratory data analysis (EDA) – cs0815 Aug 09 '18 at 10:16

exception reading in large tab separated file chunked

2 Answers2