So I have a 15gb tab delimited file with 500million rows that I must read into Python and do some analysis on. What is the most efficient way to go about this ?
I have access to a Linux server with 4 cores and 16GB RAM. At the moment I am using dask.read_csv() with nworkers = 4 for my analysis. But I am regularly getting memory issue and the jupyter kernel just dies while doing some computations like groupby and iteration.
Are there any better ways to do this or any way to avoid memory full issue in dask?