could you help me, I faced a problem of reading random rows from the large csv file using 0.18.1 pandas and 2.7.10 Python on Windows (8 Gb RAM).
In Read a small random sample from a big CSV file into a Python data frame I saw an approach, however, it occured for my PC to be very memory consuming, namely, part of the code:
n = 100
s = 10
skip = sorted(rnd.sample(xrange(1, n), n-s))# skip n-s random rows from *.csv
data = pd.read_csv(path, usecols = ['Col1', 'Col2'],
dtype = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip)
so, if I want to take some random rows from the file considering not only 100 rows, but 100 000, it becomes hard, however taking not random rows from the file is almost alright:
skiprows = xrange(100000)
data = pd.read_csv(path, usecols = ['Col1', 'Col2'],
dtype = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip, nrows = 10000)
So the question how can I deal with reading large number of random rows from the large csv file with pandas, i.e. since I can't read the entire csv file, even with chunking it, I'm interested exactly in random rows. Thanks