Not knowing exactly what you want/need to accomplish with the data does make this tricky - but most data manipulation can be done with SQL and so I would suggest using sqlite3 as the data processing engine.
sqlite3 stores data on-disk and will circumvent the impossibility of reading 20Gb of data into 16Gb or RAM.
Also, read the documentation for pandas.DataFrame.to_sql
You will need something like (not tested):
import sqlite3
conn = sqlite3.connect('out_Data.db')
data = pd.read_csv('dataset.csv', chunksize=1000, usecols=fields)
for data_chunk in data:
data_chunk.to_sql(conn, if_exists='append')
c = conn.cursor()
c.execute("SELECT * FROM data GROUPBY variable1")
<<<perform data manipulation using SQL>>>
Bear in mind that you can't bring your data into a pandas data frame unless the operations that you perform dramatically reduce the memory footprint.
To convert back to .csv follow Write to CSV from sqlite3 database in python
For better performance:
- Increase the chunk size to the maximum your system can handle
- sqlite3 CLI actually has methods for auto-importing .csv files that would be a lot quicker than going via python.