3

could you help me, I faced a problem of reading random rows from the large csv file using 0.18.1 pandas and 2.7.10 Python on Windows (8 Gb RAM).

In Read a small random sample from a big CSV file into a Python data frame I saw an approach, however, it occured for my PC to be very memory consuming, namely, part of the code:

n = 100
s = 10
skip = sorted(rnd.sample(xrange(1, n), n-s))# skip n-s random rows from *.csv       
data = pd.read_csv(path, usecols = ['Col1', 'Col2'], 
                   dtype  = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip)

so, if I want to take some random rows from the file considering not only 100 rows, but 100 000, it becomes hard, however taking not random rows from the file is almost alright:

skiprows = xrange(100000)    
data = pd.read_csv(path, usecols = ['Col1', 'Col2'], 
                   dtype  = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip, nrows = 10000)

So the question how can I deal with reading large number of random rows from the large csv file with pandas, i.e. since I can't read the entire csv file, even with chunking it, I'm interested exactly in random rows. Thanks

Community
  • 1
  • 1
Marcel Mars
  • 388
  • 5
  • 16
  • 1
    I think most of the discussion on this question will be relevant. In short there wasn't a great solution that wasn't memory-intensive. http://stackoverflow.com/questions/38039723/splitting-a-large-pandas-dataframe-with-minimal-memory-footprint/38086123#38086123 – Jeff Jul 06 '16 at 21:05
  • 1
    skiprows uses a lot of memory, try using with chunks: http://stackoverflow.com/questions/36874993/pandas-memory-error-after-a-certain-skiprows-parameter – ayhan Jul 06 '16 at 21:10
  • If you go to one of these links and find them useful, please upvote them if you can. You need 15 reputation to upvote. You now have 6. – piRSquared Jul 06 '16 at 21:13

2 Answers2

2

If memory is the biggest issue, a possible solution might be to use chunks, and randomly select from the chunks

n = 100
s = 10
factor = 1    # should be integer
chunksize = int(s/factor)

reader = pd.read_csv(path, usecols = ['Col1', 'Col2'],dtype  = {'Col1': 'int32', 'Col2':'int32'}, chunksize=chunksize)

out = []
tot = 0
for df in reader:
    nsample = random.randint(factor,chunksize)
    tot += nsample
    if  tot > s:
        nsample = s - (tot - nsample)
    out.append(df.sample(nsample))
    if tot >= s:
        break

data = pd.concat(out)

And you can use factor to control the sizes of the chunks.

Gerges
  • 6,269
  • 2
  • 22
  • 44
  • One issue is that this approach isn't truly random, as you force selections from each chunk. For example, it's possible, albeit not probable, that a random selection of n items is the first n items, and the chunked approach does not admit this possibility. However, you could also argue that this approach is good enough for most cases that require random selection. – root Jul 06 '16 at 22:33
  • That's right. I edited the answer to take that into account, where the number of samples from each chunk is random. But however, there is still a caveat in that it needs to have at least 'factor' samples per chunk, to get the required number of samples before reaching end of file. To be really random, the possibility to loop over the file more than once needs to be considered. – Gerges Jul 06 '16 at 22:48
  • Thanks, yes, indeed, I believe that is also the way out: to randomly choose chunks and randomly select rows from them, and do this operation in iterative manner. It looks less memory demanding than skiprows option. – Marcel Mars Jul 07 '16 at 10:06
1

I think this is faster than other methods showed here and may be worth trying.

Say, we have already chosen rows to be skipped in a list skipped. First, I convert it to a lookup bool table.

# Some preparation:
skipped = np.asarray(skipped)
# MAX >= number of rows in the file
bool_skipped = np.zeros(shape(MAX,), dtype=bool)
bool_skipped[skipped] = True

Main stuff:

from io import StringIO
# in Python 2 use
# from StringIO import StringIO

def load_with_buffer(filename, bool_skipped, **kwargs):
    s_buf = StringIO()
    with open(filename) as file:
        count = -1
        for line in file:
            count += 1
            if bool_skipped[count]:
                continue
            s_buf.write(line)
    s_buf.seek(0)
    df = pd.read_csv(s_buf, **kwargs)
    return df

I tested it as follows:

df = pd.DataFrame(np.random.rand(100000, 100))
df.to_csv('test.csv')

df1 = load_with_buffer('test.csv', bool_skipped, index_col=0)

with 90% of rows skipped. It performs comparably to

pd.read_csv('test.csv', skiprows=skipped, index_col=0)

and is about 3-4 times faster than using dask or reading in chunks.

ptrj
  • 5,152
  • 18
  • 31