0

I have an 8GB CSV file that contains information about companies created in France. When I try to upload it in Python using pandas.read_csv, I get various types of error; I believe it’s a combination of 3 factors that cause the problem:

  • The size of the file (8GB)
  • The French characters in the cells (like “é”)
  • The fact that this CSV file is organized like an Excel file; the fields are separated by column, just like an XLS file

When I tried to import the file using:

import pandas as pd
df = pd.read_csv(r'C:\..\data.csv')

I got the following error: OSError: Initializing from file failed

Then, to eliminate the problem about the size, I copy the file (data.csv) and paste it, only keeping the first 25 rows (data2.csv). This is a much lighter file, to eliminate the size problem:

df = pd.read_csv(r'C:\..\data2.csv')

I get the same OSError: Initializing from file failed error.

After some research, I try the following code with Data2.csv

df = pd.read_csv(r'C:\..\data2.csv', sep="\t", encoding="latin")

This time, the import successfully works, but in a weird format, like this: https://i.stack.imgur.com/XUBQn.jpg. All fields are in the same column.

So this even with the size problem eliminated, it doesn't properly read the csv file. And still, I need to work with the main file, Data.csv. So I try the same code on the initial file (data.csv):

df = pd.read_csv(r'C:\..\data.csv', sep="\t", encoding="latin")

I get: ParserError: Error tokenizing data. C error: out of memory

What is the proper code to read this data.csv properly?

Thank you,

Siva Kg
  • 59
  • 8

1 Answers1

1

From your image it looks like the file is separated by semi-colons (;). Try using ";" as the sep in the read_csv function.

Pandas reads the csv into ram - an 8GB file could easily exhaust this - try reading the file in chunks. See this answer.

N.Clarke
  • 268
  • 1
  • 6