55

I have a 2 GB file (iputfile.txt) in which every line in the file is a word, just like:

apple
red
beautiful
smell
spark
input

I need to write a program to read every word in the file and print the word count. I wrote it using Java and C++, but the result is surprising: Java runs 2.3 times faster than C++. My code are as follows:

C++:

int main() {
    struct timespec ts, te;
    double cost;
    clock_gettime(CLOCK_REALTIME, &ts);

    ifstream fin("inputfile.txt");
    string word;
    int count = 0;
    while(fin >> word) {
        count++;
    }
    cout << count << endl;

    clock_gettime(CLOCK_REALTIME, &te);
    cost = te.tv_sec - ts.tv_sec + (double)(te.tv_nsec-ts.tv_nsec)/NANO;
    printf("Run time: %-15.10f s\n", cost);

    return 0;
}

Output:

5e+08
Run time: 69.311 s

Java:

 public static void main(String[] args) throws Exception {

    long startTime = System.currentTimeMillis();

    FileReader reader = new FileReader("inputfile.txt");
    BufferedReader br = new BufferedReader(reader);
    String str = null;
    int count = 0;
    while((str = br.readLine()) != null) {
        count++;
    }
    System.out.println(count);

    long endTime = System.currentTimeMillis();
    System.out.println("Run time : " + (endTime - startTime)/1000 + "s");
}

Output:

5.0E8
Run time: 29 s

Why is Java faster than C++ in this situation, and how do I improve the performance of C++?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
dodolong
  • 855
  • 2
  • 9
  • 14
  • 13
    Run the C++ version, then the Java version, then the C++ version again. This may catch system caching. (However, 69 seconds is quite long even accounting for disk I/O). – nneonneo Apr 09 '14 at 07:10
  • 4
    Also, have you compiled with or without optimizations (-O) turned on? – nneonneo Apr 09 '14 at 07:10
  • Did you try with plain C? Try compare with typical C and adjust block and buffer sizes, that would be interesting. – Niklas Rosencrantz Apr 09 '14 at 07:16
  • 1
    @nneonneo I didn't use any -O optimization. If I need it, what level should I use ? Now I wonder that whether the C++ stream(fin) is slow than Java BufferReader? – dodolong Apr 09 '14 at 07:20
  • 8
    @user3513917 C++ STL implementations usually perform poorly when used without compiler optimizations like inlining. You should never even look at the performance of a C++ program with optimizations turned off; it's meaningless. – heinrichj Apr 09 '14 at 07:24
  • 1
    You run `readline` in Java code and `ifstream>>` in C++ code. When you run `>>` you read data on character basis, in other words your disk head moves like this (Read 1 character)=>(Input in `word`)=>repeat. In other words to read the word "apple" your disk head moves 5 times carrying 1 character, while in Java code your disk head moves 1 time carrying 5 character block (virtually speaking). The trick here is that the most time spent in I/O operations is exactly on disk head movements. This might not be exactly precise, but generally speaking it's true. – Constantine Samoilenko Apr 09 '14 at 07:38
  • 1
    @ConstantineSamoilenko Yes, this is a bottleneck in my C++ code, I change it by using getline(fin, word), and use -O2 optimation. Now C++ run 38s and Java run 32s. Java is still faster than C++. i don't know whether C++ can be faster than Java. – dodolong Apr 09 '14 at 08:07
  • 1
    @ConstantineSamoilenko There's no movement of the disk heads involved here; the data is buffered. (The tests needed to detect the end of a word _may_ be more complex than those needed to detect the end of a line, but otherwise, `>>` and `getline` are pretty similar. Both need to look at each character read.) – James Kanze Apr 09 '14 at 08:15
  • You way of reading the file is unrealistic. Read the file in large chunks and time that. And you shouldn't time the console prints anyway. – this Apr 09 '14 at 08:38
  • why do you use double for counting? 5e8 is fine within int's range. It'll take less memory and run faster – phuclv Apr 09 '14 at 08:38
  • @self. His way of reading the file is perfectly realistic; both the C++ `std::filebuf` and Java's `BufferedReader` should use buffering optimized for the given platform. If you can improve on it, then there's something wrong with the implementation. – James Kanze Apr 09 '14 at 08:48
  • @LưuVĩnhPhúc Yes, this is a mistake, I should have use the int. – dodolong Apr 09 '14 at 08:52
  • 4
    One detail: C++ doesn't automatically initialize your variable **count** to zero. – Thomas Padron-McCarthy Apr 09 '14 at 12:37
  • 2
    I am not too surprised that Java is a bit faster than C++ iostreams. The iostreams design suffers from being object oriented and using virtual functions. Unlike a Java virtual machine, C++ cannot translate virtual calls into inline calls. You get better performance by using the C stdio functions which just read data rather than trying to support virtual data sources and virtual format operations. – Zan Lynx Apr 09 '14 at 18:39
  • Oh yeah. It is *theoretically* possible for the C++ compiler to inline virtual calls if you promise it that the binary is self contained. If using GCC try the `-fwhole-program` option when compiling. – Zan Lynx Apr 09 '14 at 18:42
  • 2
    What platform is this for? Using the native file I/O API's will probably give a big performance boost – paulm Apr 09 '14 at 20:10
  • Even using `fopen` and `fscanf` will give a big performance boost. – Ben Voigt Apr 09 '14 at 22:04
  • possible duplicate of [Fast textfile reading in c++](http://stackoverflow.com/questions/17925051/fast-textfile-reading-in-c) – Ben Voigt Apr 09 '14 at 22:07
  • OT: Why write a programm at all? `wc` is the tool to go wc -w $filename done. – Thomas Junk Apr 17 '14 at 14:28

5 Answers5

65

You aren't comparing the same thing. The Java program reads lines, depening on the newline, while the C++ program reads white space delimited "words", which is a little extra work.

Try istream::getline.

Later

You might also try and do an elementary read operation to read a byte array and scan this for newlines.

Even later

On my old Linux notebook, jdk1.7.0_21 and don't-tell-me-it's-old 4.3.3 take about the same time, comparing with C++ getline. (We have established that reading words is slower.) There isn't much difference between -O0 and -O2, which doesn't surprise me, given the simplicity of the code in the loop.

Last note As I suggested, fin.read(buffer,LEN) with LEN = 1MB and using memchr to scan for '\n' results in another speed improvement of about 20%, which makes C (there isn't any C++ left by now) faster than Java.

4pie0
  • 29,204
  • 9
  • 82
  • 118
laune
  • 31,114
  • 3
  • 29
  • 42
  • 14
    Why use double for counting? – laune Apr 09 '14 at 07:15
  • 1
    I use getline(fin, word), the result is 43s , Java is still faster than it. – dodolong Apr 09 '14 at 07:32
  • @dodolong: DONT Compare the debug builds. Use optimizations. – Nawaz Apr 09 '14 at 07:42
  • 1
    On Linux, the preprocessing of a text file is a no-op, and I expect that most implementations will recognize this, and simply ignore the text/binary options. On all other systems, the preprocessing necessary to convert what is read to the text format will impact performance in C++. – James Kanze Apr 09 '14 at 08:46
  • @dodolong: Who says that C++ has to be faster than Java? JIT will kick in fast enough, and there's C at the bottom of Java as well. – laune Apr 09 '14 at 08:48
  • @laune No one has said anything about pure CPU performance (which can vary---in the end, both Java and C++ will be compiled to machine code, more or less optimally). The issue here is IO. – James Kanze Apr 09 '14 at 08:50
  • 1
    Concerning your last note: it also makes the processing significantly more complicated. What is the effect if you set the buffer size on the `std::filebuf` to 1MB? – James Kanze Apr 09 '14 at 08:53
  • @James Kanze Using a 1MB buffer with ifstream and getline reduces the time by 10%. The other 10% the "pure C" version gains should be due to avoiding the getline calls. – laune Apr 09 '14 at 10:27
  • 2
    +1 for eventually devolving into C for the sake of performance. – 2rs2ts Apr 09 '14 at 12:18
  • I'm sure using ZwCreateFile and ZwReadFile on windows in C++ would beat Java and C due to cutting out many buffer copies – paulm Apr 09 '14 at 20:07
  • @JamesKanze: With C++ iostreams, the performance limiter is rarely the I/O system. The facets logic is very slow. At least on common implementations. – Ben Voigt Apr 09 '14 at 22:03
  • @BenVoigt The required use of the `codecvt` facet in `std::filebuf` can slow things down considerably if not implemented correctly; it doesn't have to, but as you point out, in many common implementations, it does. Still, if the data isn't already cached by the system, it can take around 10ms to get the 8K buffer from the disk. Even a poor implementation of `std::filebuf` will not need anywhere near 10ms to remap the buffer. – James Kanze Apr 10 '14 at 08:23
  • @James: 10ms seek time, sure. But OSes use read-ahead caching, so you aren't paying 10ms per 8KB block. More like 10ms seek time + 10ms transfer time per 1MB block (assuming minimal fragmentation). And unfortunately, `std::filebuf` often does take more than 20ms/MB on popular implementations. – Ben Voigt Apr 10 '14 at 12:11
  • If you set a 1MB buffer for the C version you should do the same for the Java version (constructor parameter for `BufferedReader`). – Hauke Ingmar Schmidt Apr 16 '14 at 10:57
8

There are a number of significant differences in the way the languages handle I/O, all of which can make a difference, one way or another.

Perhaps the first (and most important) question is: how is the data encoded in the text file. If it is single-byte characters (ISO 8859-1 or UTF-8), then Java has to convert it into UTF-16 before processing; depending on the locale, C++ may (or may not) also convert or do some additional checking.

As has been pointed out (partially, at least), in C++, >> uses a locale specific isspace, getline will simply compare for '\n', which is probably faster. (Typical implementations of isspace will use a bitmap, which means an additional memory access for each character.)

Optimization levels and specific library implementations may also vary. It's not unusual in C++ for one library implementation to be 2 or 3 times faster than another.

Finally, a most significant difference: C++ distinguishes between text files and binary files. You've opened the file in text mode; this means that it will be "preprocessed" at the lowest level, before even the extraction operators see it. This depends on the platform: for Unix platforms, the "preprocessing" is a no-op; on Windows, it will convert CRLF pairs into '\n', which will have a definite impact on performance. If I recall correctly (I've not used Java for some years), Java expects higher level functions to handle this, so functions like readLine will be slightly more complicated. Just guessing here, but I suspect that the additional logic at the higher level costs less in runtime than the buffer preprocessing at the lower level. (If you are testing under Windows, you might experiment with opening the file in binary mode in C++. This should make no difference in the behavior of the program when you use >>; any extra CR will be considered white space. With getline, you'll have to add logic to remove any trailing '\r' to your code.)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
James Kanze
  • 150,581
  • 18
  • 184
  • 329
5

I would suspect that the main difference is that java.io.BufferedReader performs better than the std::ifstream because it buffers, while the ifsteam does not. The BufferedReader reads large chunks of the file in advance and hands them to your program from RAM when you call readLine(), while the std::ifstream only reads a few bytes at a time when you prompt it to by calling the >>-operator.

Sequential access of large amounts of data from the hard drive is usually much faster than accessing many small chunks one at a time.

A fairer comparison would be to compare std::ifstream to the unbuffered java.io.FileReader.

Philipp
  • 67,764
  • 9
  • 118
  • 153
  • 6
    `std::ifstream` forwards to `std::filebuf`, which normally buffers. (The standard only requires a one character buffer, but most implementations will use something like 8K.) – James Kanze Apr 09 '14 at 08:44
4

I am not expert in C++, but you have at least the following to affect performance:

  1. OS level caching for the file
  2. For Java you are using a buffered reader and the buffer size defaults to a page or something. I am not sure how C++ streams does this.
  3. Since the file is so big that JIT would probably be kicked in, and it probably compiles the Java byte code better than if you don't turn any optimization on for your C++ compiler.

Since I/O cost is the major cost here, I guess 1 and 2 are the major reasons.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Alex Suo
  • 2,977
  • 1
  • 14
  • 22
  • Given his description, I'm sure that the differing buffering strategies plays a significant roll. – James Kanze Apr 09 '14 at 08:16
  • Yeah, given that the program is probably I/O-bound and the JIT compiler will take care of the central loop anyway, one wouldn't expect any inherent advantage for C++. – Christian Apr 10 '14 at 00:13
2

I would also try using mmap instead of standard file read/write. This should let your OS handle the reading and writing while your application is only concerned with the data.

There's no situation where C++ can't be faster than Java, but sometimes it takes a lot of work from very talented people. But I don't think this one should be too hard to beat as it is a straightforward task.

mmap for Windows is described in File Mapping (MSDN).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
rich remer
  • 3,407
  • 2
  • 34
  • 47