2

Apologies if this is re -- certainly many people must face this problem, yet I didn't find a post that quite discussed this problem. I'd like to find the optimal solution.

I have a large dataset stored as a text file, where each line is one datapoint. I want to use the data for a supervised learning problem, and I don't want to keep the whole dataset in memory.

I can use iterators to read the data without loading the entire file into memory, but how can I perform a randomized test/train/validate split?

My best idea so far:

  1. figure out how many lines the document has

  2. randomly assign line indices to test/train/validate

  3. write up a generator that only reads those lines

For (1) and (3), I wonder: what is the most elegant way to do this? (in python 3)

space
  • 21
  • 2
  • What stops you from re-writing that file "on the fly" with iterators into 3 different sets? If nothing, I'll write a full answer with this idea, just wondering if there are any other constraints. – Filip Malczak May 30 '17 at 17:51
  • 1
    In general, we use a trivial program to a trivial program split the data into three files. Then we simply use "normal" input routines to access the data. Is this not applicable for some reason? – Prune May 30 '17 at 20:46
  • Just to alert others: the "not all in memory" requirement means that SK's `train_test_split method` won't solve the problem. – Prune May 30 '17 at 21:55
  • Besides Prunes answer, I guess simply buying more memory would also solve the problem. (how big is your file? I guess not more than 8gb) – Martin Thoma May 30 '17 at 23:37
  • For counting the lines : https://stackoverflow.com/a/1019572/562769 – Martin Thoma May 30 '17 at 23:39

0 Answers0