There's not a quick way to do it with a scan_csv
originated LazyFrame because at some point it has to scan the whole file to get a random row towards the end.
This is a shortcoming of the csv format where the reader can only get to an arbitrary line by scanning through the file line by line looking for the \n
character to denote the end of a particular line.
If you didn't care about knowing which line it is then you could just seek
to a random place in the file, find the end of that line and then take the next full line but polars isn't optimized to do that. Doing this is problematic because lines which follow longer lines will have a greater chance of being selected so depending on the variance in line length and the importance of randomness, this might make this unusable.
Notwithstanding the disclaimer, you could do:
import random
import os
with open(A_very_large_text_file, "r") as ff:
ff.seek(random.choice(range(os.path.getsize(A_very_large_text_file))))
ff.readline() # ignore partial line
randomish_row=pl.DataFrame({f"col{i}":x for i, x in enumerate(ff.readline()[:-1].split(","))})
Alternatively, use pyarrow to convert your csv file into a parquet file with multiple row groups. Then you can create your LazyFrame with scan_parquet
. Since parquet files are highly structured, it can much more efficiently jump to a random part of the file. See here