Test/Train/Validate Split of large datasets

Question

Apologies if this is re -- certainly many people must face this problem, yet I didn't find a post that quite discussed this problem. I'd like to find the optimal solution.

I have a large dataset stored as a text file, where each line is one datapoint. I want to use the data for a supervised learning problem, and I don't want to keep the whole dataset in memory.

I can use iterators to read the data without loading the entire file into memory, but how can I perform a randomized test/train/validate split?

My best idea so far:

figure out how many lines the document has
randomly assign line indices to test/train/validate
write up a generator that only reads those lines

For (1) and (3), I wonder: what is the most elegant way to do this? (in python 3)

What stops you from re-writing that file "on the fly" with iterators into 3 different sets? If nothing, I'll write a full answer with this idea, just wondering if there are any other constraints. — Filip Malczak, May 30 '17 at 17:51
In general, we use a trivial program to a trivial program split the data into three files. Then we simply use "normal" input routines to access the data. Is this not applicable for some reason? — Prune, May 30 '17 at 20:46
Just to alert others: the "not all in memory" requirement means that SK's `train_test_split method` won't solve the problem. — Prune, May 30 '17 at 21:55
Besides Prunes answer, I guess simply buying more memory would also solve the problem. (how big is your file? I guess not more than 8gb) — Martin Thoma, May 30 '17 at 23:37
For counting the lines : https://stackoverflow.com/a/1019572/562769 — Martin Thoma, May 30 '17 at 23:39

Test/Train/Validate Split of large datasets

0 Answers0