138

I like some features of D, but would be interested if they come with a runtime penalty?

To compare, I implemented a simple program that computes scalar products of many short vectors both in C++ and in D. The result is surprising:

  • D: 18.9 s [see below for final runtime]
  • C++: 3.8 s

Is C++ really almost five times as fast or did I make a mistake in the D program?

I compiled C++ with g++ -O3 (gcc-snapshot 2011-02-19) and D with dmd -O (dmd 2.052) on a moderate recent linux desktop. The results are reproducible over several runs and standard deviations negligible.

Here the C++ program:

#include <iostream>
#include <random>
#include <chrono>
#include <string>

#include <vector>
#include <array>

typedef std::chrono::duration<long, std::ratio<1, 1000>> millisecs;
template <typename _T>
long time_since(std::chrono::time_point<_T>& time) {
      long tm = std::chrono::duration_cast<millisecs>( std::chrono::system_clock::now() - time).count();
  time = std::chrono::system_clock::now();
  return tm;
}

const long N = 20000;
const int size = 10;

typedef int value_type;
typedef long long result_type;
typedef std::vector<value_type> vector_t;
typedef typename vector_t::size_type size_type;

inline value_type scalar_product(const vector_t& x, const vector_t& y) {
  value_type res = 0;
  size_type siz = x.size();
  for (size_type i = 0; i < siz; ++i)
    res += x[i] * y[i];
  return res;
}

int main() {
  auto tm_before = std::chrono::system_clock::now();

  // 1. allocate and fill randomly many short vectors
  vector_t* xs = new vector_t [N];
  for (int i = 0; i < N; ++i) {
    xs[i] = vector_t(size);
      }
  std::cerr << "allocation: " << time_since(tm_before) << " ms" << std::endl;

  std::mt19937 rnd_engine;
  std::uniform_int_distribution<value_type> runif_gen(-1000, 1000);
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < size; ++j)
      xs[i][j] = runif_gen(rnd_engine);
  std::cerr << "random generation: " << time_since(tm_before) << " ms" << std::endl;

  // 2. compute all pairwise scalar products:
  time_since(tm_before);
  result_type avg = 0;
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j) 
      avg += scalar_product(xs[i], xs[j]);
  avg = avg / N*N;
  auto time = time_since(tm_before);
  std::cout << "result: " << avg << std::endl;
  std::cout << "time: " << time << " ms" << std::endl;
}

And here the D version:

import std.stdio;
import std.datetime;
import std.random;

const long N = 20000;
const int size = 10;

alias int value_type;
alias long result_type;
alias value_type[] vector_t;
alias uint size_type;

value_type scalar_product(const ref vector_t x, const ref vector_t y) {
  value_type res = 0;
  size_type siz = x.length;
  for (size_type i = 0; i < siz; ++i)
    res += x[i] * y[i];
  return res;
}

int main() {   
  auto tm_before = Clock.currTime();

  // 1. allocate and fill randomly many short vectors
  vector_t[] xs;
  xs.length = N;
  for (int i = 0; i < N; ++i) {
    xs[i].length = size;
  }
  writefln("allocation: %i ", (Clock.currTime() - tm_before));
  tm_before = Clock.currTime();

  for (int i = 0; i < N; ++i)
    for (int j = 0; j < size; ++j)
      xs[i][j] = uniform(-1000, 1000);
  writefln("random: %i ", (Clock.currTime() - tm_before));
  tm_before = Clock.currTime();

  // 2. compute all pairwise scalar products:
  result_type avg = cast(result_type) 0;
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j) 
      avg += scalar_product(xs[i], xs[j]);
  avg = avg / N*N;
  writefln("result: %d", avg);
  auto time = Clock.currTime() - tm_before;
  writefln("scalar products: %i ", time);

  return 0;
}
apaderno
  • 28,547
  • 16
  • 75
  • 90
Lars
  • 2,616
  • 3
  • 21
  • 18
  • I'd love to know how this goes when you've re-profiled it. +1 and +Star. – Sion Sheevok Feb 28 '11 at 13:43
  • I normally take a look at http://shootout.alioth.debian.org/u32/which-programming-languages-are-fastest.php but it seems D does not appear here. – Matthieu M. Feb 28 '11 at 13:55
  • I believe the maintainer of the shootout doesn't think D is interesting enough to be included in the contest, but he encourages people to use the same scripts to create their own variations of the contest. – Vladimir Panteleev Feb 28 '11 at 14:03
  • 4
    By the way, your program has a bug on this line: `avg = avg / N*N` (order of operations). – Vladimir Panteleev Feb 28 '11 at 14:05
  • 5
    You can try to rewrite the code using array/vector operations http://www.digitalmars.com/d/2.0/arrays.html – Michal Minich Feb 28 '11 at 16:17
  • 11
    To provide a better comparison you should use the same compiler back-end. Either DMD and DMC++ or GDC and G++ – he_the_great Feb 28 '11 at 19:38
  • 2
    @Sion Sheevok Unfortunately, dmd profiling seems not to be available for Linux? (please correct me if I am wrong, but if I say `dmd ... trace.def` I get an `error: unrecognized file extension def`. And the dmd docs for [optlink](http://www.digitalmars.com/ctg/optlink.html) mention only Windows. – Lars Feb 28 '11 at 23:03
  • @Lars, the def files are a Windows thing and there does not seem to be an equivalent in Linux. – he_the_great Mar 01 '11 at 00:40
  • @Lars, what about the -profile command-line option? – Trass3r Mar 02 '11 at 21:32
  • @Trass3r I do not know how to use -profile on Linux. Is it possible at all? I thought not. See the comments above. – Lars Mar 02 '11 at 21:50
  • I don't know what you are trying but enabling profiling is as easy as passing -profile to dmd, on Windows and on Linux. – Trass3r Mar 03 '11 at 09:06
  • @Trass3r So far it is clear. dmd -profile writes `trace.def`. And then? I thought you have to compile a 2nd time w./o. -profile but with `trace.def` on the command line. On Linux, this throws the error message `unrecognized file extension` as described above. What else should I do to use the profiling information? – Lars Mar 03 '11 at 09:21
  • 2
    Ah, never cared about that .def file it spits out. Timings are inside the .log file. "It contains the list of the functions in the order the linker should link them" - maybe that helps optlink to optimize something? Also note that "In addition, ld fully supports the standard "*.def" files, which may be specified on the linker command line like an object file" - so you could try to pass trace.def via -L if you dearly want to. – Trass3r Mar 03 '11 at 11:58
  • Compiling with `-Ltrace.def` does not work either: `/usr/bin/ld:trace.def: file format not recognized; treating as linker script` and then `/usr/bin/ld:trace.def:3: syntax error`. I guess `trace.def` is another def file type than optlinks and that only optlink can take advantage of dmd's profiling information and optlink is not available on Linux, so one cannot optimize code via profiling with dmd on Linux ? Maybe time to start another question on How to profile D programs on Linux with dmd ? – Lars Mar 03 '11 at 22:02

8 Answers8

66

To enable all optimizations and disable all safety checks, compile your D program with the following DMD flags:

-O -inline -release -noboundscheck

EDIT: I've tried your programs with g++, dmd and gdc. dmd does lag behind, but gdc achieves performance very close to g++. The commandline I used was gdmd -O -release -inline (gdmd is a wrapper around gdc which accepts dmd options).

Looking at the assembler listing, it looks like neither dmd nor gdc inlined scalar_product, but g++/gdc did emit MMX instructions, so they might be auto-vectorizing the loop.

Vladimir Panteleev
  • 24,651
  • 6
  • 70
  • 114
34

One big thing that slows D down is a subpar garbage collection implementation. Benchmarks that don't heavily stress the GC will show very similar performance to C and C++ code compiled with the same compiler backend. Benchmarks that do heavily stress the GC will show that D performs abysmally. Rest assured, though, this is a single (albeit severe) quality-of-implementation issue, not a baked-in guarantee of slowness. Also, D gives you the ability to opt out of GC and tune memory management in performance-critical bits, while still using it in the less performance-critical 95% of your code.

I've put some effort into improving GC performance lately and the results have been rather dramatic, at least on synthetic benchmarks. Hopefully these changes will be integrated into one of the next few releases and will mitigate the issue.

dsimcha
  • 67,514
  • 53
  • 213
  • 334
  • 2
    I noticed one of your changes was a change from division to bit shift. Shouldn't that be something the compiler does? – GManNickG Feb 28 '11 at 21:40
  • 4
    @GMan: Yes, if the value you're dividing by is known at compile time. No, if the value is only known at runtime, which was the case where I made that optimization. – dsimcha Feb 28 '11 at 21:46
  • 1
    @dsimcha: Hm. I figure if you know to make it, the compiler can too. Quality of implementation issue, or am I missing that some condition needs to be satisfied that the compiler cannot prove, but you know? (I'm learning D now, so these little things about the compiler are suddenly interesting to me. :)) – GManNickG Feb 28 '11 at 21:53
  • 15
    @GMan: Bit shifting only works if the number you're dividing by is a power of two. The compiler cannot prove this if the number is known only at runtime, and testing and branching would be slower than just using the div instruction. My case is unusual because the value is known only at runtime, but I know at compile time that it's going to be some power of two. – dsimcha Feb 28 '11 at 21:56
  • Ah, okay, "and testing and branching would be slower than just using the div instruction" slipped my mind. Thanks. :) – GManNickG Feb 28 '11 at 22:05
  • 8
    Note that the program posted in this example doesn't do allocation in the time-consuming portion. – Vladimir Panteleev Feb 28 '11 at 23:25
30

This is a very instructive thread, thanks for all the work to the OP and helpers.

One note - this test is not assessing the general question of abstraction/feature penalty or even that of backend quality. It focuses on virtually one optimization (loop optimization). I think it's fair to say that gcc's backend is somewhat more refined than dmd's, but it would be a mistake to assume that the gap between them is as large for all tasks.

Andrei Alexandrescu
  • 3,214
  • 19
  • 17
  • 5
    I completely agree. As added later on, I am mainly interested in performance for numerical computations where loop optimization probably is the most important one. Which other optimizations do you think would be important for numerical computing? And which computations would test them? I would be interested to complement my test and implement some more tests (if they are roughly as simple). But evtl. this is another question in its own? – Lars Mar 02 '11 at 21:41
  • 16
    As an engineer that cut his teeth on C++, you are a hero of mine. Respectfully, however, this should be a comment, not an answer. – Alan Mar 29 '14 at 07:18
20

Definitely seems like a quality-of-implementation issue.

I ran some tests with the OP's code and made some changes. I actually got D going faster for LDC/clang++, operating on the assumption that arrays must be allocated dynamically (xs and associated scalars). See below for some numbers.

Questions for the OP

Is it intentional that the same seed be used for each iteration of C++, while not so for D?

Setup

I have tweaked the original D source (dubbed scalar.d) to make it portable between platforms. This only involved changing the type of the numbers used to access and modify the size of arrays.

After this, I made the following changes:

  • Used uninitializedArray to avoid default inits for scalars in xs (probably made the biggest difference). This is important because D normally default-inits everything silently, which C++ does not.

  • Factored out printing code and replaced writefln with writeln

  • Changed imports to be selective
  • Used pow operator (^^) instead of manual multiplication for final step of calculating average
  • Removed the size_type and replaced appropriately with the new index_type alias

...thus resulting in scalar2.cpp (pastebin):

    import std.stdio : writeln;
    import std.datetime : Clock, Duration;
    import std.array : uninitializedArray;
    import std.random : uniform;

    alias result_type = long;
    alias value_type = int;
    alias vector_t = value_type[];
    alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint

    immutable long N = 20000;
    immutable int size = 10;

    // Replaced for loops with appropriate foreach versions
    value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
      value_type res = 0;
      for(index_type i = 0; i < size; ++i)
        res += x[i] * y[i];
      return res;
    }

    int main() {
      auto tm_before = Clock.currTime;
      auto countElapsed(in string taskName) { // Factor out printing code
        writeln(taskName, ": ", Clock.currTime - tm_before);
        tm_before = Clock.currTime;
      }

      // 1. allocate and fill randomly many short vectors
      vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
      for(index_type i = 0; i < N; ++i)
        xs[i] = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
      countElapsed("allocation");

      for(index_type i = 0; i < N; ++i)
        for(index_type j = 0; j < size; ++j)
          xs[i][j] = uniform(-1000, 1000);
      countElapsed("random");

      // 2. compute all pairwise scalar products:
      result_type avg = 0;
      for(index_type i = 0; i < N; ++i)
        for(index_type j = 0; j < N; ++j)
          avg += scalar_product(xs[i], xs[j]);
      avg /= N ^^ 2;// Replace manual multiplication with pow operator
      writeln("result: ", avg);
      countElapsed("scalar products");

      return 0;
    }

After testing scalar2.d (which prioritized optimization for speed), out of curiousity I replaced the loops in main with foreach equivalents, and called it scalar3.d (pastebin):

    import std.stdio : writeln;
    import std.datetime : Clock, Duration;
    import std.array : uninitializedArray;
    import std.random : uniform;

    alias result_type = long;
    alias value_type = int;
    alias vector_t = value_type[];
    alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint

    immutable long N = 20000;
    immutable int size = 10;

    // Replaced for loops with appropriate foreach versions
    value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
      value_type res = 0;
      for(index_type i = 0; i < size; ++i)
        res += x[i] * y[i];
      return res;
    }

    int main() {
      auto tm_before = Clock.currTime;
      auto countElapsed(in string taskName) { // Factor out printing code
        writeln(taskName, ": ", Clock.currTime - tm_before);
        tm_before = Clock.currTime;
      }

      // 1. allocate and fill randomly many short vectors
      vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
      foreach(ref x; xs)
        x = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
      countElapsed("allocation");

      foreach(ref x; xs)
        foreach(ref val; x)
          val = uniform(-1000, 1000);
      countElapsed("random");

      // 2. compute all pairwise scalar products:
      result_type avg = 0;
      foreach(const ref x; xs)
        foreach(const ref y; xs)
          avg += scalar_product(x, y);
      avg /= N ^^ 2;// Replace manual multiplication with pow operator
      writeln("result: ", avg);
      countElapsed("scalar products");

      return 0;
    }

I compiled each of these tests using an LLVM-based compiler, since LDC seems to be the best option for D compilation in terms of performance. On my x86_64 Arch Linux installation I used the following packages:

  • clang 3.6.0-3
  • ldc 1:0.15.1-4
  • dtools 2.067.0-2

I used the following commands to compile each:

  • C++: clang++ scalar.cpp -o"scalar.cpp.exe" -std=c++11 -O3
  • D: rdmd --compiler=ldc2 -O3 -boundscheck=off <sourcefile>

Results

The results (screenshot of raw console output) of each version of the source as follows:

  1. scalar.cpp (original C++):

    allocation: 2 ms
    
    random generation: 12 ms
    
    result: 29248300000
    
    time: 2582 ms
    

    C++ sets the standard at 2582 ms.

  2. scalar.d (modified OP source):

    allocation: 5 ms, 293 μs, and 5 hnsecs 
    
    random: 10 ms, 866 μs, and 4 hnsecs 
    
    result: 53237080000
    
    scalar products: 2 secs, 956 ms, 513 μs, and 7 hnsecs 
    

    This ran for ~2957 ms. Slower than the C++ implementation, but not too much.

  3. scalar2.d (index/length type change and uninitializedArray optimization):

    allocation: 2 ms, 464 μs, and 2 hnsecs
    
    random: 5 ms, 792 μs, and 6 hnsecs
    
    result: 59
    
    scalar products: 1 sec, 859 ms, 942 μs, and 9 hnsecs
    

    In other words, ~1860 ms. So far this is in the lead.

  4. scalar3.d (foreaches):

    allocation: 2 ms, 911 μs, and 3 hnsecs
    
    random: 7 ms, 567 μs, and 8 hnsecs
    
    result: 189
    
    scalar products: 2 secs, 182 ms, and 366 μs
    

    ~2182 ms is slower than scalar2.d, but faster than the C++ version.

Conclusion

With the correct optimizations, the D implementation actually went faster than its equivalent C++ implementation using the LLVM-based compilers available. The current gap between D and C++ for most applications seems only to be based on limitations of current implementations.

Erich Gubler
  • 1,105
  • 10
  • 12
10

dmd is the reference implementation of the language and thus most work is put into the frontend to fix bugs rather than optimizing the backend.

"in" is faster in your case cause you are using dynamic arrays which are reference types. With ref you introduce another level of indirection (which is normally used to alter the array itself and not only the contents).

Vectors are usually implemented with structs where const ref makes perfect sense. See smallptD vs. smallpt for a real-world example featuring loads of vector operations and randomness.

Note that 64-Bit can also make a difference. I once missed that on x64 gcc compiles 64-Bit code while dmd still defaults to 32 (will change when the 64-Bit codegen matures). There was a remarkable speedup with "dmd -m64 ...".

Trass3r
  • 5,858
  • 2
  • 30
  • 45
8

Whether C++ or D is faster is likely to be highly dependent on what you're doing. I would think that when comparing well-written C++ to well-written D code, they would generally either be of similar speed, or C++ would be faster, but what the particular compiler manages to optimize could have a big effect completely aside from the language itself.

However, there are a few cases where D stands a good chance of beating C++ for speed. The main one which comes to mind would be string processing. Thanks to D's array slicing capabalities, strings (and arrays in general) can be processed much faster than you can readily do in C++. For D1, Tango's XML processor is extremely fast, thanks primarily to D's array slicing capabilities (and hopefully D2 will have a similarly fast XML parser once the one that's currently being worked on for Phobos has been completed). So, ultimately whether D or C++ is going to be faster is going to be very dependent on what you're doing.

Now, I am suprised that you're seeing such a difference in speed in this particular case, but it is the sort of thing that I would expect to improve as dmd improves. Using gdc might yield better results and would likely be a closer comparison of the language itself (rather than the backend) given that it's gcc-based. But it wouldn't surprise me at all if there are a number of things which could be done to speed up the code that dmd generates. I don't think that there's much question that gcc is more mature than dmd at this point. And code optimizations are one of the prime fruits of code maturity.

Ultimately, what matters is how well dmd performs for your particular application, but I do agree that it would definitely be nice to know how well C++ and D compare in general. In theory, they should be pretty much the same, but it really depends on the implementation. I think that a comprehensive set of benchmarks would be required to really test how well the two presently compare however.

Jonathan M Davis
  • 37,181
  • 17
  • 72
  • 102
  • 4
    Yes I would be surprised if input/output was significantly faster in either language, or if pure math were significantly faster in either language, but string operations, memory management, and a few other things could easily let one language shine. – Max Lybbert Feb 28 '11 at 18:25
  • 1
    It's easy to do better (faster) than C++ iostreams. But that's primarily a library implementation issue (on all known versions from the most popular vendors). – Ben Voigt Feb 28 '11 at 23:41
5

You can write C code is D so as far as which is faster, it will depend on a lot of things:

  • What compiler you use
  • What feature you use
  • how aggressively you optimize

Differences in the first aren't fair to drag in. The second might give C++ an advantage as it, if anything, has fewer heavy features. The third is the fun one: D code in some ways is easier to optimize because in general it is easier to understand. Also it has the ability to do a large degree of generative programing allowing things like verbose and repetitive but fast code to be written in a shorter forms.

BCS
  • 75,627
  • 68
  • 187
  • 294
4

Seems like a quality of implementation issue. For example, here's what I've been testing with:

import std.datetime, std.stdio, std.random;

version = ManualInline;

immutable N = 20000;
immutable Size = 10;

alias int value_type;
alias long result_type;
alias value_type[] vector_type;

result_type scalar_product(in vector_type x, in vector_type y)
in
{
    assert(x.length == y.length);
}
body
{
    result_type result = 0;

    foreach(i; 0 .. x.length)
        result += x[i] * y[i];

    return result;
}

void main()
{   
    auto startTime = Clock.currTime();

    // 1. allocate vectors
    vector_type[] vectors = new vector_type[N];
    foreach(ref vec; vectors)
        vec = new value_type[Size];

    auto time = Clock.currTime() - startTime;
    writefln("allocation: %s ", time);
    startTime = Clock.currTime();

    // 2. randomize vectors
    foreach(ref vec; vectors)
        foreach(ref e; vec)
            e = uniform(-1000, 1000);

    time = Clock.currTime() - startTime;
    writefln("random: %s ", time);
    startTime = Clock.currTime();

    // 3. compute all pairwise scalar products
    result_type avg = 0;

    foreach(vecA; vectors)
        foreach(vecB; vectors)
        {
            version(ManualInline)
            {
                result_type result = 0;

                foreach(i; 0 .. vecA.length)
                    result += vecA[i] * vecB[i];

                avg += result;
            }
            else
            {
                avg += scalar_product(vecA, vecB);
            }
        }

    avg = avg / (N * N);

    time = Clock.currTime() - startTime;
    writefln("scalar products: %s ", time);
    writefln("result: %s", avg);
}

With ManualInline defined I get 28 seconds, but without I get 32. So the compiler isn't even inlining this simple function, which I think it's clear it should be.

(My command line is dmd -O -noboundscheck -inline -release ....)

GManNickG
  • 494,350
  • 52
  • 494
  • 543
  • 1
    Your timings are meaningless unless you give the comparison to your C++ timings also. – hiddensunset4 Jun 22 '11 at 06:59
  • 3
    @Daniel: You're missing the point. It was to demonstrate the D optimizations in isolation, namely for the conclusion I stated: "So the compiler isn't even inlining this simple function, which I think it's clear it should be." I'm even attempting to compare it to C++, as I clearly stated in the *first* sentence: "Seems like a quality of implementation issue." – GManNickG Jun 22 '11 at 08:01
  • Ah true, sorry :). You'll also find the DMD compiler doesn't vectorise the loops at all either. – hiddensunset4 Jun 22 '11 at 08:38