2

I have a 350MB tab separated text file. If I try to read it into memory I get an out of memory exception. So I am trying something along those lines (i.e. only read in a few columns):

import pandas as pd

input_file_and_path = r'C:\Christian\ModellingData\X.txt'

column_names = [
    'X1'
    # , 'X2
]
raw_data = pd.DataFrame()
for chunk in pd.read_csv(input_file_and_path, names=column_names, chunksize=1000, sep='\t'):
    raw_data = pd.concat([raw_data, chunk], ignore_index=True)

print(raw_data.head())

Unfortunately, I get this:

Traceback (most recent call last):
  File "pandas\_libs\parsers.pyx", line 1134, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
  File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/xxxx/EdaDataPrepRange1.py", line 17, in <module>
    for chunk in pd.read_csv(input_file_and_path, header=None, names=column_names, chunksize=1000, sep='\t'):
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas\_libs\parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
  File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte

Any ideas. Btw how can I generally deal with large files and impute for example missing variables? Ultimately, I need to read in everything to determine, for example, the median to be imputed.

sheldonzy
  • 5,505
  • 9
  • 48
  • 86
cs0815
  • 16,751
  • 45
  • 136
  • 299
  • For the encoding error part, I have written an [answer](https://stackoverflow.com/a/51763708/3545273) on another question that was specific on unicodedecoderrors in pandas csv – Serge Ballesta Aug 09 '18 at 09:47

2 Answers2

3

use encoding="utf-8" while using pd.read_csv

Here they have used this encoding. see if this works. open(file path, encoding='windows-1252'):

Reference: 'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

Working Solution

to use encoding encoding="ISO-8859-1"

Upasana Mittal
  • 2,480
  • 1
  • 14
  • 19
  • Tried: for chunk in pd.read_csv(input_file_and_path, names=column_names, chunksize=1000, sep='\t', encoding="utf-8") still get same error )-: – cs0815 Aug 09 '18 at 10:18
  • @cs0815 This shouldn't be the case though as utf-8 automatically reads ascii character as well. Hope your pandas is updated. Here they have used this encoding. see if this works. `open(file path, encoding='windows-1252')`: https://stackoverflow.com/questions/48067514/utf-8-codec-cant-decode-byte-0xa0-in-position-4276-invalid-start-byte – Upasana Mittal Aug 09 '18 at 10:34
  • thanks. should be as a just installed anaconda yesterday from scratch. – cs0815 Aug 09 '18 at 10:37
  • 1
    encoding="ISO-8859-1" worked for me in the end ... you may want to adapt your answer ... – cs0815 Aug 09 '18 at 13:28
  • Sure. will do. Thanks :) – Upasana Mittal Aug 09 '18 at 13:29
2

Regarding your large file problem, just use a file handler and context manager:

with open("your_file.txt") as fileObject:
    for line in fileObject:
        do_something_with(line)

## No need to close file as 'with' automatically does that

This won't load the whole file into memory. Instead, it'll load a line at a time, and will 'forget' previous lines unless you store a reference.

Also, regarding your encoding problem, just use encoding="utf-8" while using pd.read_csv.

Adi219
  • 4,712
  • 2
  • 20
  • 43