I'm trying to import a large tab/txt (size = 3 gb) file into Python using pandas pd.read_csv("file.txt",sep="\t")
. The file I load was a ".tab" file of which I changed the extension to ".txt" to import it with read_csv()
. It is a file with 305 columns and +/- 1 000 000 rows.
When I execute the code, after some time Python returns a MemoryError. I searched for some information and this basically means that there is not enough RAM available. When I specify nrows = 20
in read_csv()
it works fine.
The computer I'm using has 46gb of RAM of which roughly 20 gb was available for Python.
My question: How is it possible that a file of 3gb needs more than 20gb of RAM to be imported into Python using pandas read_csv()
? Am I doing anything wrong?
EDIT: When executing df.dtypes
the types are a mix of object
, float64
, and int64
UPDATE: I used the following code to overcome the problem and perform my calculations:
summed_cols=pd.DataFrame(columns=["sample","read sum"])
while x<352:
x=x+1
sample_col=pd.read_csv("file.txt",sep="\t",usecols=[x])
summed_cols=summed_cols.append(pd.DataFrame({"sample":[sample_col.columns[0]],"read sum":sum(sample_col[sample_col.columns[0]])}))
del sample_col
it now selects a column, performs a calculation, stores the result in a dataframe, deletes the current column, and moves to the next column