Apologies if this is re -- certainly many people must face this problem, yet I didn't find a post that quite discussed this problem. I'd like to find the optimal solution.
I have a large dataset stored as a text file, where each line is one datapoint. I want to use the data for a supervised learning problem, and I don't want to keep the whole dataset in memory.
I can use iterators to read the data without loading the entire file into memory, but how can I perform a randomized test/train/validate split?
My best idea so far:
figure out how many lines the document has
randomly assign line indices to test/train/validate
write up a generator that only reads those lines
For (1) and (3), I wonder: what is the most elegant way to do this? (in python 3)