I was surprised to know that Python 3.5.2
is much slower than Python 2.7.12
. I wrote a simple command line command that calculates the number of lines in a huge CSV-file.
$ cat huge.csv | python -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 15 seconds
$ cat huge.csv | python3 -c "import sys; print(sum(1 for _ in sys.stdin))"
101253515
# it took 66 seconds
Python 2.7.12 took 15 seconds, Python 3.5.2 took 66 seconds. I expected that the difference may take place, but why is it so huge? What's new in Python 3 that makes it much slower towards such kind of tasks? Is there a faster way to calculate the number of lines in Python 3?
My CPU is Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
.
The size of huge.csv
is 18.1 Gb and it contains 101253515 lines.
Asking this question, I don't need exactly to find the number of lines of a big file at any cost. I just wrote a particular case where Python 3 is much slower. Actually, I am developing a script in Python 3 that deals with big CSV files, some operations don't suppose of using csv
library. I know, I could write the script in Python 2, and it would be acceptable towards the speed. But I would like to know a way to write similar script in Python 3. This is why I am interested what makes Python 3 slower in my example and how it can be improved by "honest" python approaches.