Python 33gb csv file Dataset to Pandas DataFrame

Question

Im kinda new to Python and Datascience.

I have a 33gb csv file Dataset, and i want to parse it in a DataFrame to do some stuff on it.

I tried to do it the 'Casual' with pandas.read_csv and it's taking ages to parse..

I searched on the internet and found this article.

It says that the most efficent way to read a large csv file is to use csv.DictReader.

So i tried to do that :

import pandas as pd
import csv

df = pd.DataFrame(csv.DictReader(open("MyFilePath")))

Even with this solution it's taking ages to do the job..

Can you please guys tell me what's the most efficient way to parse a large dataset into pandas?

33gb is a truly massive text file so it will inevitably take ages. Does your machine actually have enough memory to handle the resulting dataframe? — Simon Notley, Nov 22 '19 at 09:01
Have you seen https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas ? — Thierry Lathuille, Nov 22 '19 at 09:02
as @ThierryLathuille states you can read it and process in chunks. but this will still take ages nevertheless. it will only handle possible memory limitations — luigigi, Nov 22 '19 at 09:04
The data is huge, whatever you do will take time. Can you fit all the data in memory at once? If not, you have no other choice but manipulate it by chunks. — Thierry Lathuille, Nov 22 '19 at 09:06
@ThierryLathuille I'll do as you Suggested and chuck the data — Fragan, Nov 22 '19 at 09:52

score 1 · Accepted Answer · answered Nov 22 '19 at 09:07

There is no way you can read such a big file in a short time. Anyway there are some strategies to deal with a large data, these are some of them which give u opportunity to implement your code without leaving the comfort of Pandas:

Sampling
Chunking
Optimising Pandas dtypes
Parallelising Pandas with Dask.

The most simple option is sampling your dataset(This may be helpful for you). Sometimes a random part ofa large dataset will already contain enough information to do next calculations. If u don't actually need to process your entire dataset this is excellent technique to use. sample code :

import pandas
import random

filename = "data.csv" 
n = sum(1 for line in open(filename)) - 1 # number of lines in file
s = n//m  # part of the data
skip = sorted(random.sample(range(1, n + 1), n - s))
df = pandas.read_csv(filename, skiprows=skip)

This is the link for Chunking large data.

Python 33gb csv file Dataset to Pandas DataFrame

1 Answers1