Python: which is the best way to read large .csv file?

Question

I have to read large .csv of around 20MB. Those files are tables composed by 8 columns and 5198 rows. I have to do some statistics over a specific column I.

I have n different files and this what I am doing:

stat = np.arange(n)
    I = 0
    for k in stat:
        df = pd.read_csv(pathS+'run_TestRandom_%d.csv'%k, sep=' ')
        I+=df['I']
    I = I/k ## Average

This process takes 0.65s and I wondering if there is a fastest way.

- If the data are exclusively numeric then there's no need to use the **csv** module. You could use **split**. - There's some, small overhead for using the dictionary to access record fields. You could instead use **find** on the header in the csv and then use that index to obtain items from the split record. — Bill Bell, Nov 30 '16 at 17:19
the first row is not numeric though, is it possible to use `split`? — emax, Nov 30 '16 at 17:21
@furas: That was my thought. Beyond that, depending on disk fragmentation, taking 0.65 seconds to read a 20 MB file could be near the limit of the disk (last I checked, most spinning disks top out below 100 MB/s even for contiguous data, so on a cold read, taking you'd expect at least 0.2s for reading, more if fragmented, ignoring all processing costs). __Edit__: Looks like desktop class drives peak closer to 150 MB/s nowadays, with laptop drives in the 70-100 MB/s range. Even so, fragmentation can cut that by a factor of 10x, easily. — ShadowRanger, Nov 30 '16 at 18:07

Indigo · Answer 1 · 2016-11-30T18:50:49.503

-2

EDIT: Apparently this is a really bad way to do it! Don't do what I did I guess :/

I'm working on a similar problem right now with about the same size dataset. The method I'm using is numpy's genfromtxt

import numpy as np

ary2d = np.genfromtxt('yourfile.csv', delimiter=',', skip_header=1,
    skip_footer=0, names=['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'])

On my system it times to about .1sec in total

The one problem with this is that any value that is non-numeric is simply replaced by nan which may not be what you want

edited Nov 30 '16 at 18:50

answered Nov 30 '16 at 17:41

Indigo

962
1
8
23

Given that [`genfromtxt` is slower than `read_csv`](http://stackoverflow.com/q/21486963/364696) and that it doesn't actually support true CSV (a delimiter of `,` is not the same thing as proper CSV which covers quoting, escapes, etc.), I'm not sure how this would help. `read_csv` does CSV correctly, and is optimized for CSV, where `genfromtxt` is wrong and general purpose (read: Likely slower than specialized code), so `genfromtxt` is the wrong way to go. – ShadowRanger Nov 30 '16 at 18:13

Python: which is the best way to read large .csv file?

1 Answers1