0

I have a training data-set in CSV format of size 6 GB which I am required to analyze and implement machine learning on it. My system RAM is 6 GB so it is not possible for me to load the file in the memory. I need to perform random sampling and load the samples from the data-set. The number of samples may vary according to requirement. How to do this?

  • 1
    You can use Python CSV reader to load the file in chunks and sample from each chunk. – DYZ Sep 22 '17 at 02:58
  • Possible duplicate of [Reading a huge .csv file](https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file) . There are other simlar Q&A's some with different answers. – wwii Sep 22 '17 at 04:25
  • Yes I tried that but I dont know the actual size of the my dataset so I could not create chunks properly, ended up overloading my system – Shoumik Goswami Sep 22 '17 at 09:48
  • There are a lot of solutions here on SO, some of them use itertools.islice to consume lines that aren't being sampled - there is a `consume` function in the [Itertools Recipes](https://docs.python.org/3/library/itertools.html#itertools-recipes). You should be able to make that approach work. – wwii Sep 22 '17 at 14:02
  • I also like this answer https://stackoverflow.com/a/6347142/2823755 - A single pass over the file to create a list of line indices/positions. Then you seek to the line you want to sample. – wwii Sep 22 '17 at 14:04
  • Please read [mre] and explain exactly what "perform random sampling" entails. For example, do you need to sample the cells of a line, and repeat this for each line? Do you need to choose a small random subset **of the lines** in the file and load them? Something else? – Karl Knechtel Aug 01 '22 at 23:31

1 Answers1

2

Something to start with:

with open('dataset.csv') as f:
    for line in f:
        sample_foo(line.split(","))

This will load only one line at a time in memory and not the whole file.

Raju Pitta
  • 606
  • 4
  • 5
  • This is the right answer and the pythonnic way to do it. Since python uses generator instead of loading the whole file to the memory no memory pressure happens. – geckos Sep 22 '17 at 03:41
  • 1
    You may also want to mention using something like reservoir sampling (see https://en.wikipedia.org/wiki/Reservoir_sampling). While using iterators is a good way to save on memory, you still need a way to sample the entries. Also, is there is a header the first line should be saved and the iteration should begin with the second line. – beigel Sep 22 '17 at 03:59
  • So I do not know the number of records in the dataset, and I want to have a sample size of say a fixed percentage of the dataset, "random samples"! Is it possible to make it happen? – Shoumik Goswami Sep 22 '17 at 09:47