I have big svmlight files that I'm using for machine learning purpose. I'm trying to see if a sumsampling of those files would lead to good enough results.
I want to extract random lines of my files to feed them into my models but I want to load the less possible information in RAM.
I saw here (Read a number of random lines from a file in Python) that I could use linecache but all the solution end up loading everything in memory.
Could someone give me some hints? Thank you.
EDIT : forgot to say that I know the number of lines in my files beforehand.