Sampling from a 6GB csv file without loading in Python

Question

I have a training data-set in CSV format of size 6 GB which I am required to analyze and implement machine learning on it. My system RAM is 6 GB so it is not possible for me to load the file in the memory. I need to perform random sampling and load the samples from the data-set. The number of samples may vary according to requirement. How to do this?

You can use Python CSV reader to load the file in chunks and sample from each chunk. — DYZ, Sep 22 '17 at 02:58
Possible duplicate of [Reading a huge .csv file](https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file) . There are other simlar Q&A's some with different answers. — wwii, Sep 22 '17 at 04:25
Yes I tried that but I dont know the actual size of the my dataset so I could not create chunks properly, ended up overloading my system — Shoumik Goswami, Sep 22 '17 at 09:48
There are a lot of solutions here on SO, some of them use itertools.islice to consume lines that aren't being sampled - there is a `consume` function in the [Itertools Recipes](https://docs.python.org/3/library/itertools.html#itertools-recipes). You should be able to make that approach work. — wwii, Sep 22 '17 at 14:02
I also like this answer https://stackoverflow.com/a/6347142/2823755 - A single pass over the file to create a list of line indices/positions. Then you seek to the line you want to sample. — wwii, Sep 22 '17 at 14:04
Please read [mre] and explain exactly what "perform random sampling" entails. For example, do you need to sample the cells of a line, and repeat this for each line? Do you need to choose a small random subset **of the lines** in the file and load them? Something else? — Karl Knechtel, Aug 01 '22 at 23:31

score 2 · Answer 1 · answered Sep 22 '17 at 03:11

2

Something to start with:

with open('dataset.csv') as f:
    for line in f:
        sample_foo(line.split(","))

This will load only one line at a time in memory and not the whole file.

answered Sep 22 '17 at 03:11

Raju Pitta

606
4
5

This is the right answer and the pythonnic way to do it. Since python uses generator instead of loading the whole file to the memory no memory pressure happens. – geckos Sep 22 '17 at 03:41
1

You may also want to mention using something like reservoir sampling (see https://en.wikipedia.org/wiki/Reservoir_sampling). While using iterators is a good way to save on memory, you still need a way to sample the entries. Also, is there is a header the first line should be saved and the iteration should begin with the second line. – beigel Sep 22 '17 at 03:59
So I do not know the number of records in the dataset, and I want to have a sample size of say a fixed percentage of the dataset, "random samples"! Is it possible to make it happen? – Shoumik Goswami Sep 22 '17 at 09:47

Sampling from a 6GB csv file without loading in Python

1 Answers1