-3

I have the following Go code:

package main

import ("fmt"
        "os"
        "bufio")

func main() {
    reader := bufio.NewReader(os.Stdin)
    scanner := bufio.NewScanner(reader)

    for scanner.Scan() {
        fmt.Println(scanner.Text())
    }
}

and the following Python code:

import sys

for ln in sys.stdin:
    print ln,

Both simply read lines from standard input and print to standard output. The Python version only uses 1/4 of the time the Go version needs (tested on a 16 million line text file and output to /dev/null). Why is that?

UPDATE: Following JimB and siritinga's advice, I changed Go's output to a buffered version. Now the Go version is much faster, but still about 75% slower than the Python version.

package main

import ("os"
        "bufio")

func main() {
    reader := bufio.NewReader(os.Stdin)
    scanner := bufio.NewScanner(reader)
    writer := bufio.NewWriter(os.Stdout)

    for scanner.Scan() {
        writer.WriteString(scanner.Text()+"\n")
    }
}
Gatozee
  • 81
  • 5
  • Thanks! I read the C++ v Python question. However, Go's bufio does do buffering. So that doesn't answer my question. – Gatozee Jan 15 '15 at 16:27
  • I recommend [editing your question](http://stackoverflow.com/posts/27967765/edit) to explain that you've read the other question (please link to it explicitly) and discuss why your question is different. – Air Jan 15 '15 at 16:29
  • Scanner also implements buffering, so use simply Scanner, but you should also buffer the output. – siritinga Jan 15 '15 at 16:45
  • @Gatozee: Python io is all buffered. You need to buffer stdout too. – JimB Jan 15 '15 at 16:46
  • @JimB, you are right. Replacing fmt.Println with bufio's Writer.WriteString improves the performance. – Gatozee Jan 15 '15 at 18:17
  • and if you stop allocating a new string for every line, you can beat python's performance considerably. – JimB Jan 15 '15 at 18:47
  • [the question](http://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python) is not a duplictate though they are closely related. Go and C++ are different languages. There could be recommendations about improving Go I/O performance that does not apply to c++. – jfs Jan 15 '15 at 22:01
  • 1
    please, do not edit your question inplace. Does the `1/4` result apply to `writer.WriteString()` or `fmt.Println` variant? If you think that you found a satisfactory solution then post it as an answer (don't put it in the question) – jfs Jan 15 '15 at 22:05
  • I would strongly recommend rephrasing the question so you show what the unoptimized version is, followed by the optimized version, but placed in an answer, as J.F. Sebastian suggests. By doing an in-place editing and removing the unoptimzed version, it's now very difficult to understand the context or the original situation. Without knowing the history, the question as stated currently makes no sense. – dyoo Jan 15 '15 at 22:05
  • I restored the original. – Simon Whitehead Jan 15 '15 at 22:09
  • Sorry, guys. I'm new here. Thanks for the suggestion! – Gatozee Jan 16 '15 at 17:58

1 Answers1

4

As JimB said, stop using strings. Python 2.x strings are just raw bytes. Go strings are UTF-8. That requires encoding, checking for errors and so on. On the other hand, you also get more features out of strings. Also, building strings requires extra memory allocation.

If you change to unicode strings (upgrade to 3.x or unicode string implementation for 2.x) with your Python implementation the performance will tank. If you change to similar encoding with Go version, you will get much better performance:

package main

import ("os"
        "bufio")

func main() {
    reader := bufio.NewReader(os.Stdin)
    scanner := bufio.NewScanner(reader)
    writer := bufio.NewWriter(os.Stdout)
    newline := []byte("\n")

    for scanner.Scan() {
        writer.Write(scanner.Bytes())
        writer.Write(newline)
    }
}

On my system, using a word list with 65 million lines, Python:

real    0m12.724s
user    0m12.581s
sys     0m0.145s

And the Go version:

real    0m4.408s
user    0m4.276s
sys     0m0.135s

It should also be noted that as far as performance comparisons go this is not a good case. It does not represent what a real application would do, what would be to handle the data somehow.

user918176
  • 1,770
  • 13
  • 34
  • That works great! In reality, I need to process each line in different ways. I just want to improve the program one piece at a time. On the other hand, with raw byte arrays, I can't use the functions in package 'string', like split, ,trim etc. – Gatozee Jan 16 '15 at 19:36
  • @Gatozee Well, if you've got a lot of data coming in you should definitely study also channels and goroutines. For instance in this example the reading & waiting partially waited for each other... The optimization done was not yet the fastest possible. – user918176 Jan 16 '15 at 19:38
  • Just found out package 'bytes' has the equivalent functions in 'string' for byte slices. Great! – Gatozee Jan 20 '15 at 14:01