1

I have to read large .csv of around 20MB. Those files are tables composed by 8 columns and 5198 rows. I have to do some statistics over a specific column I.

I have n different files and this what I am doing:

stat = np.arange(n)
    I = 0
    for k in stat:
        df = pd.read_csv(pathS+'run_TestRandom_%d.csv'%k, sep=' ')
        I+=df['I']
    I = I/k ## Average

This process takes 0.65s and I wondering if there is a fastest way.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
emax
  • 6,965
  • 19
  • 74
  • 141
  • maybe try to specify `memory_map=True` in `pd.read_csv` – Gosha F Nov 30 '16 at 17:11
  • - If the data are exclusively numeric then there's no need to use the **csv** module. You could use **split**. - There's some, small overhead for using the dictionary to access record fields. You could instead use **find** on the header in the csv and then use that index to obtain items from the split record. – Bill Bell Nov 30 '16 at 17:19
  • the first row is not numeric though, is it possible to use `split`? – emax Nov 30 '16 at 17:21
  • 2
    `20MB` is not large file. `20GB` is lager file. – furas Nov 30 '16 at 17:30
  • @furas: That was my thought. Beyond that, depending on disk fragmentation, taking 0.65 seconds to read a 20 MB file could be near the limit of the disk (last I checked, most spinning disks top out below 100 MB/s even for contiguous data, so on a cold read, taking you'd expect at least 0.2s for reading, more if fragmented, ignoring all processing costs). __Edit__: Looks like desktop class drives peak closer to 150 MB/s nowadays, with laptop drives in the 70-100 MB/s range. Even so, fragmentation can cut that by a factor of 10x, easily. – ShadowRanger Nov 30 '16 at 18:07

1 Answers1

-2

EDIT: Apparently this is a really bad way to do it! Don't do what I did I guess :/

I'm working on a similar problem right now with about the same size dataset. The method I'm using is numpy's genfromtxt

import numpy as np

ary2d = np.genfromtxt('yourfile.csv', delimiter=',', skip_header=1,
    skip_footer=0, names=['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'])

On my system it times to about .1sec in total

The one problem with this is that any value that is non-numeric is simply replaced by nan which may not be what you want

Indigo
  • 962
  • 1
  • 8
  • 23
  • Given that [`genfromtxt` is slower than `read_csv`](http://stackoverflow.com/q/21486963/364696) and that it doesn't actually support true CSV (a delimiter of `,` is not the same thing as proper CSV which covers quoting, escapes, etc.), I'm not sure how this would help. `read_csv` does CSV correctly, and is optimized for CSV, where `genfromtxt` is wrong and general purpose (read: Likely slower than specialized code), so `genfromtxt` is the wrong way to go. – ShadowRanger Nov 30 '16 at 18:13