6

I was surprised to know that Python 3.5.2 is much slower than Python 2.7.12. I wrote a simple command line command that calculates the number of lines in a huge CSV-file.

$ cat huge.csv | python -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 15 seconds

$ cat huge.csv | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 66 seconds

Python 2.7.12 took 15 seconds, Python 3.5.2 took 66 seconds. I expected that the difference may take place, but why is it so huge? What's new in Python 3 that makes it much slower towards such kind of tasks? Is there a faster way to calculate the number of lines in Python 3?

My CPU is Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz.

The size of huge.csv is 18.1 Gb and it contains 101253515 lines.

Asking this question, I don't need exactly to find the number of lines of a big file at any cost. I just wrote a particular case where Python 3 is much slower. Actually, I am developing a script in Python 3 that deals with big CSV files, some operations don't suppose of using csv library. I know, I could write the script in Python 2, and it would be acceptable towards the speed. But I would like to know a way to write similar script in Python 3. This is why I am interested what makes Python 3 slower in my example and how it can be improved by "honest" python approaches.

Sebastian Wozny
  • 16,943
  • 7
  • 52
  • 69
Fomalhaut
  • 8,590
  • 8
  • 51
  • 95
  • https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python – Cory Kramer Nov 07 '17 at 12:32
  • 1
    @CoryKramer it doesn't explain the difference between python versions. – Fomalhaut Nov 07 '17 at 12:33
  • This doesn't answer your question, but why not use `wc` instead of Python? `cat huge.csv | wc -l`, or `wc -l huge.csv`. – mhawke Nov 07 '17 at 12:34
  • @mhawke Because I I want to make it as a piece of my python script. – Fomalhaut Nov 07 '17 at 12:36
  • Python3 architecture and datatypes are bigger/complexer and so i guess it´s something very special because python 2 and 3 are even, but python 3 get more and more performance updates - so I guess you can´t compare the same task on both versions. to it python 2 and 3 way or do benchmarks and compare them. – LenglBoy Nov 07 '17 at 12:36
  • @mhawke By the way, `wc -l huge.csv` takes 15 seconds as Python 2. – Fomalhaut Nov 07 '17 at 12:37
  • @LenglBoy Thanks, but I don't intend to benchmark them. I need a fast way to process big CSV-files in Python 3 as it is in Python 2. Or to find out why it becomes impossible. – Fomalhaut Nov 07 '17 at 12:40
  • Maybe have a look here [consuming huge csv in python](https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3) – LenglBoy Nov 07 '17 at 12:43
  • You can analyse the execution by using a profiler: `echo "import sys; print(sum(1 for _ in sys.stdin))" > linecount.py` `cat huge.csv | python -m cProfile linecount.py` This will show you which function how often gets called and what time it took. But I can't explain why it's taking much longer. I also tested it and it takes about 50% longer with pyhon 3. – Jan Zeiseweis Nov 07 '17 at 12:48
  • @Fomalhaut: that doesn't sound right, `wc` should be _much_ faster, and it is when I test it on a 85M line file. Perhaps there is an I/O bottleneck. Also, I find Python 3 time comparable to Python 2. – mhawke Nov 07 '17 at 12:49
  • @mhawke My file is much bigger. I updated my question. Please, have a look. – Fomalhaut Nov 07 '17 at 12:52
  • @Fomalhaut: What processing do you need to do with the CSV data? Do you need to hold the large chunks of the file in memory, or can you process line-by-line? Depends entirely on your application, but since you're not just counting lines, perhaps a database would be better than a flat file. – mhawke Nov 07 '17 at 13:21
  • @mhawke I want to aggregate some data and to store the result into another file. It doesn't require to keep chunks in memory. As far as I see, the slowest place is reading from the source file. – Fomalhaut Nov 07 '17 at 13:25
  • @Fomalhaut: _Maybe_ `pandas` would be faster? Alternatively read the file once into a database. Add relevant indices. Query using aggregate functions. – mhawke Nov 07 '17 at 13:31

1 Answers1

6

sys.stdin object is a bit more complicated in Python3 then it was in Python2. For example by default reading from sys.stdin in Python3 converts the input into unicode, thus it fails on non-unicode bytes:

$ echo -e "\xf8" | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <genexpr>
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Note that Python2 doesn't have any problem with that input. So as you can see Python3's sys.stdin does more things under the hood. I'm not sure if this is exactly responsible for the performance loss but you can investigate it further by trying sys.stdin.buffer under Python3:

import sys
print(sum(1 for _ in sys.stdin.buffer))

Note that .buffer doesn't exist in Python2. I've done some tests and I don't see real difference in performance between Python2's sys.stdin and Python3's sys.stdin.buffer but YMMV.

EDIT Here are some random results on my machine: ubuntu 16.04, i7 cpu, 8GiB RAM. First some C code (as a base for comparison):

#include <unistd.h>

int main() {
    char buffer[4096];
    size_t total = 0;
    while (true) {
        int result = ::read(STDIN_FILENO, buffer, sizeof(buffer));
        total += result;
        if (result == 0) {
            break;
        }
    }
    return 0;
};

now the file size:

$ ls -s --block-size=M | grep huge2.txt 
10898M huge2.txt

and tests:

// a.out is a simple C equivalent code (except for the final print)
$ time cat huge2.txt | ./a.out

real    0m20.607s
user    0m0.236s
sys     0m10.600s


$ time cat huge2.txt | python -c "import sys; print(sum(1 for _ in sys.stdin))"
898773889

real    1m24.268s
user    1m20.216s
sys     0m8.724s


$ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin.buffer))"
898773889

real    1m19.734s
user    1m14.432s
sys     0m11.940s


$ time cat huge2.txt | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
898773889

real    2m0.326s
user    1m56.148s
sys     0m9.876s

So the file I've used was a bit smaller and times were longer ( it seems that you have a better machine and I didn't have patience for larger files :D ). Anyway Python2 and Python3's sys.stdin.buffer are quite similar in my tests. Python3's sys.stdin is way slower. And all of them are waaaay behind the C code (which has almost 0 user time).

freakish
  • 54,167
  • 9
  • 132
  • 169
  • Thanks. I checked it takes `23 seconds`, that's much faster than without `.buffer`. But it's 1.5 times slower than Python 2. – Fomalhaut Nov 07 '17 at 13:30
  • 1
    @Fomalhaut Did you do multiple tests and took average? You read from disk and pipe it to a process. When I run this code on my machine the results vary: one time python2 is faster and second time python3. This is due to the noise generated by the OS (and other processes). But on average they were quite close (Python2 a bit faster, something like 2-3%, not much). If they are not in your case then I suppose this is how it is: in your setup Python3 is just slower. – freakish Nov 07 '17 at 13:34
  • I measured the consumed time carefully, multiple times, taking average, of course. Probably, you're right, my Python 2 setup is slower. I think 15 and 23 seconds are comparable. – Fomalhaut Nov 07 '17 at 13:38
  • @freakish That doesn't count the number of lines, that just counts the size. You'd have to for loop over the buffer and count '\n'. I'd edit your answer but it has a benchmark so I can't (As it'll affect the output). It's probably better benchmarking to include a final printf so it's actually input/output identical to the python, and so you would've noticed the logic error. – Nicholas Pipitone Jul 31 '18 at 18:38
  • @NicholasPipitone why would I want to count number of lines? OP's code doesn't do that. – freakish Jul 31 '18 at 19:22