5

I am trying to perform very simple computations on a huge file like counting the numbers of label for some columns or the average and standard deviation for other columns. The file is too big to fit in memory and I am currently processing it per line with:

unique = {key: [] for key in categorical_keys}
means = {key: 0.0 for key in numerical_keys}
sds = {key: 0.0 for key in numerical_keys}
with open('input/train.csv', 'r') as read_file:
    reader = csv.DictReader(read_file, delimiter=',', quotechar='|')
    for i, row in enumerate(reader):
        for key, value in row.iteritems():
            if key in categorical_keys:
                if row[key] not in unique[key]:
                    unique[key].extend([value])
            elif key in numerical_keys:
                if value:
                    means[key] = (means[key]*i + float(value))/(i+1)
                    if i > 1:
                        sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i

Now this seems to be too slow and I am wondering if it would faster to process it in chunk that could fit in memory. Would it be faster ? If yes, why ?

Thanks for your help.

Romain
  • 741
  • 4
  • 16
  • File I/O is already buffered. Have you looked into options to mess with the buffer? And by "seems too slow" do you mean that you ran this script and it took an unacceptably long time to process, or that you're looking at your code and you have a hunch that there's a faster way to do it? – TigerhawkT3 May 04 '16 at 23:14
  • Python has its own performance limitation after all. If there's nothing you can optimize in your parsing, maybe consider writing this in C/C++? – yelsayed May 04 '16 at 23:17
  • 1/ I haven't look into buffer, any more precision on that ? 2/ It takes a long time to run but still manageable. I can always try to parallelize and see how much it will reduce. But somehow I feel like such simple computation shouldn't take that long. – Romain May 04 '16 at 23:22
  • 1
    What some people do is just try different things and hope for improvement. You can do [*this*](http://stackoverflow.com/a/4299378/23771), which takes little time, and it's likely to surprise you about where the time actually goes. If you fix what it tells you, you stand a much better chance of speeding up the code. – Mike Dunlavey May 05 '16 at 00:08

2 Answers2

4

Loop optimizations

If you need to gain some speed:

  • be sure, you really need the speedup (otherwise you spend too much time on worthless task)
  • start with loops
    • check, if some loops can be prevented
    • optimize/remove instructions inside the loop
      • each instruction counts
      • each reference counts

Here is my draft of optimized code (not tested):

from collections import defaultdict


unique = defaultdict(set)
means = {key: 0.0 for key in numerical_keys}
sds = {key: 0.0 for key in numerical_keys}
with open('input/train.csv', 'r') as read_file:
    reader = csv.DictReader(read_file, delimiter=',', quotechar='|')
    for i, row in enumerate(reader):
        for key in categorical_keys:
            unique[key].add(row[key])
        for key in numerical_keys:
            try:
                # shall throw ValueError if None or empty string
                value=float(row[key])
                mean_val = (means[key]*i + value)/(i+1)
                means[key] = mean_val
                # following fails for i < = 1 with ZeroDivisionError
                sds[key] = (sds[key]*(i-1) + (value-mead_val)**2)/i
            except (ValueError, ZeroDivisionError):
                pass

Collecting unique values

You use dict with list of unique values:

unique = {key: [] for key in categorical_keys}

and add to it unique values as list item (happens within loop):

if key in categorical_keys:
    if row[key] not in unique[key]:
        unique[key].extend([value])

You could safe some test (if the value exists in the list) if you directly add the value into set - the set will take care, only unique values are collected there.

Using defaultdict will assure, you have already some empty set present in case you use any key, which was not used yet.

Not testing type of record keys in each loop, know them in advance

Your code repeatedly loops over record keys and test them for type, then does something:

        if key in categorical_keys:
            if row[key] not in unique[key]:
                unique[key].extend([value])
        elif key in numerical_keys:
            if value:
                means[key] = (means[key]*i + float(value))/(i+1)
                if i > 1:
                    sds[key] = (sds[key]*(i-1) + (float(value)-means[key])**2)/i

You can safe these tests, if your categorical_keys and numerical_keys are properly set to existing values. Then you can directly loop over known key names:

        for key in categorical_keys:
            unique[key].add(row[value])
        for key in numerical_keys:
            try:
                # shall throw ValueError if None or empty string
                value=float(row[value])
                means[key] = (means[key]*i + value)/(i+1)
                if i > 1:
                    sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
            except ValueError:
                pass

Resuse once calculated value

Your code repeatedly calculates the value:

float(value)

Do it once and reuse.

Also the mean[key] value is once calculated and set into means[key], and two lines later using the value again. Better store the value in local variable and use it twice. Any lookup (like means[key]) costs something.

Catching exception is mostly faster then value test

Your code tests for value not being empty:

        elif key in numerical_keys:
            if value:
                # something here

You can replace it with code, which directly works with the value. If the value is wrong, it will fail and exception ValueError will be catched and ignored. If you have most values set, this will speed it up.

            try:
                value=float(value)
                means[key] = (means[key]*i + value)/(i+1)
                if i > 1:
                    sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
            except ValueError:
                pass

Can you prevent the if i > 1: test?

This condition is in most cases true, but you test it in each loop. If you find a way (I did not) to prevent this test, you get it faster too.

As you proposed, we can resolve it by catching ZeroDivisionError for i <= 1:

            try:
                # shall throw ValueError if None or empty string
                value=float(value)
                means[key] = (means[key]*i + value)/(i+1)
                # for i <= 1 shall raise ZeroDivisionError
                sds[key] = (sds[key]*(i-1) + (value-means[key])**2)/i
            except (ValueError, ZeroDivisionError):
                pass

Processing data in chunks

Regarding processing in chunks:

  • it definitely adds some complexity
  • it might speed the program up, if done carefully
  • it might slow it down or provide wrong results

Where chunking can improve the speed

Reading files in larger pieces

This sounds obvious, but libraries already take care of it. Expect minor or none improvement.

Getting CSV records in chunks

I am not aware of a method of csv.reader or csv.DictReader allowing to get chunk of records directly. You would have to do it yourself. This is possible, I would recommend using itertools.groupby.

Do not expect speedup from it on its own (it will slow it down a bit), but it is prerequisite for other chunk-based speedups later on.

Adding chunk of values to a set

The code (currently) adds values to a set one by one. If you do it with chunk of values (the more, the better), it will be faster as each python call has some small overhead.

Calculating mean and sds values

You could take advantage of statistics package, which is likely to have optimized code (but it seems to be in pure python anyway).

Anyway, as you are going to process the data in chunks, simple statistics.mean will not work for you or you would have to aggregate the results somehow together (if it is even possible).

If you calculate the value yourself, with careful coding you can get some speedup, based mostly on the fact, you get values directly in one chunk and will not have to dereference value by value in each loop.

Chunking conclusions

To me the chunking optimization seems to be too complex and it is hard to predict, if it brings any value.

Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
  • Thanks for this great post. 5 time faster !!! Could you change "value" to "row[key]" and "if i >1 ..." to "try ... except ZeroDivisionError", please ?. Also, would there be any pluses in processing data in chunk vs per line ? – Romain May 05 '16 at 01:08
  • @Romain Glad to hear about the speedup. I edited the code as you have proposed. Hope it is done properly. I will add answer to chunks into my answer. – Jan Vlcinsky May 05 '16 at 09:57
  • @Romain Added answer to chunking. – Jan Vlcinsky May 05 '16 at 10:46
1

Jan's answer is great, but if you want to push the speed some more, you can do your own research in the following:

Line-by-line profile of the execution: if you don't know what's slow, you can't improve further (https://github.com/rkern/line_profiler)

Since most of your code is numeric operations, you may be wasting a lot of time on the type checking overhead. There are two paths here - either decorate the code with types and use Cython, or leave the code as it is and use Pypy - both should give you a good boost.

viraptor
  • 33,322
  • 10
  • 107
  • 191