Read large dataset Pandas

Question

I'm trying to read a dataset of 20gb. I've searched for a solution,I've tried:

   data = pd.read_csv('dataset.csv', chunksize=1000,usecols=fields)
   df = pd.concat(data, ignore_index=True)

but still getting a memory error when passing to concatenate. (I changed chunksize many time, still the same)

I have 16gb of RAM working at 3000mhz.

Any suggestions?

I am trying to import the data into a dataframe for a Data Analysis and manipulation the export it back. (Data need to be cleaned from nans and noisy data ).

do you really need the whole data set in memory or can you process it in chunks? — MaxU - stand with Ukraine, Oct 19 '17 at 15:16
[This answer](https://stackoverflow.com/a/46425826/4889267) may be related - take a read (it's for opening large excel files and it's suggested to use CSV) — Alan Kavanagh, Oct 19 '17 at 15:17
Related: https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas?rq=1 also what makes you believe that you can load a 20GB file when you only have 16GB of RAM? — EdChum, Oct 19 '17 at 15:21
@MaxU I've seen your solution but didn't get the "process" task — hdatas, Oct 19 '17 at 15:21
before concatenating, could you process / compact your dataset. i.e. get rid of extra columns, using a dictionary encoding, etc. — Haleemur Ali, Oct 19 '17 at 15:21
@hdatas, i'd suggest you to rephrase you question describing what do you want to achieve with those CSV files - there is a chance we will find a working solution ... ;-) — MaxU - stand with Ukraine, Oct 19 '17 at 15:23
@EdChum I've seen this solution too but coun't implement it successfully — hdatas, Oct 19 '17 at 15:23
@hdatas, i don't think we can help you having that less information. Some "Data Analysis and manipulation" can be done in chunks some (like sorting, grouping, etc.) can't. — MaxU - stand with Ukraine, Oct 19 '17 at 15:30
@hdatas, I'd recommend you either to get more RAM or to consider using Dask DataFrame's or even Apache Spark Cluster (buying more memory should be cheaper ;) — MaxU - stand with Ukraine, Oct 19 '17 at 15:33
Look at this: http://pythondata.wpengine.com/working-large-csv-files-python/. Have you tried using the Dask framework? — skrubber, Oct 19 '17 at 15:42

score 1 · Answer 1 · answered Oct 19 '17 at 16:20

Not knowing exactly what you want/need to accomplish with the data does make this tricky - but most data manipulation can be done with SQL and so I would suggest using sqlite3 as the data processing engine.

sqlite3 stores data on-disk and will circumvent the impossibility of reading 20Gb of data into 16Gb or RAM.

Also, read the documentation for pandas.DataFrame.to_sql

You will need something like (not tested):

import sqlite3
conn = sqlite3.connect('out_Data.db')

data = pd.read_csv('dataset.csv', chunksize=1000, usecols=fields)

for data_chunk in data:
    data_chunk.to_sql(conn, if_exists='append')

c = conn.cursor()
c.execute("SELECT * FROM data GROUPBY variable1")
<<<perform data manipulation using SQL>>>

Bear in mind that you can't bring your data into a pandas data frame unless the operations that you perform dramatically reduce the memory footprint.

To convert back to .csv follow Write to CSV from sqlite3 database in python

For better performance:

Increase the chunk size to the maximum your system can handle
sqlite3 CLI actually has methods for auto-importing .csv files that would be a lot quicker than going via python.

Read large dataset Pandas

1 Answers1

Linked