How to make my C# code faster to compute dot product

Question

I'm trying to implement neural network and deep learning code in C#. Sample code in my text book is written in Python, so I'm trying to convert them to C#.

My question is that calculating dot product with numpy is extremely faster than my C# code written from scratch.

While my numpy code takes a few second to calculate dot product 1000 times, my C# code takes much longer than it.

Here is my question. How can I make my C# code faster?

Here is numpy code:

C:\temp>more dot.py
from datetime import datetime

import numpy as np

W = np.random.randn(784, 100)
x = np.random.randn(100, 784)

print(datetime.now().strftime("%Y/%m/%d %H:%M:%S"))

for i in range(0,1000):
    np.dot(x, W)

print(datetime.now().strftime("%Y/%m/%d %H:%M:%S"))

C:\temp>\Python35\python.exe dot.py
2017/02/08 00:49:14
2017/02/08 00:49:16
C:\temp>

And this is C# code:

public static double[,] dot(double[,] a, double[,] b)
{
    double[,] dot = new double[a0, b1];

    for (int i = 0; i < a.GetLength(0); i++)
    {
        for (int j = 0; j < b.GetLength(1); j++)
        {
            // the next loop looks way slow according to the profiler
            for (int k = 0; k < b.GetLength(0); k++)
                dot[i, j] += a[i, k] * b[k, j];
        }
    }
    return dot;
}

static void Main(string[] args)
{
    // compatible function with np.random.randn()
    double[,] W = random_randn(784, 100);
    double[,] x = random_randn(100, 784);

    Console.WriteLine(DateTime.Now.ToString("F"));
    for (int i = 0; i < 1000; i++)
        dot(W, x);
    Console.WriteLine(DateTime.Now.ToString("F"));
}

Regards,

Why are you implementing neural networks from scratch? If it's a learning exercise then it does not matter much how fast the code runs. If it's to get stuff working well, then use an already-written high quality software. There are many packages with neural network models, like TensorFlow, H2O, Torch. They are all much better engineered with more features and higher speed than what just one person can make with C#. — Geoffrey Anderson, Feb 07 '17 at 16:23
Right. It's just for my learning for both C# and deep learning, but I found that calculating dot products look too slow than I expected, and I felt painful to run the examples (ported to C#) in my textbook. So, I would like to improve the performance. And I'm going to use some existing libraries for my future production systems in terms of performance and better implementation. — snaga, Feb 08 '17 at 16:51

score 2 · Answer 1 · answered Feb 07 '17 at 16:12

2

Numpy is extremly optimized by using BLAS. You will probably not get such a good performance using your own code.

The dot product is though very well parallelizable. You could look into working multi-threaded, but to be honest it's not worth the effort. Just look for a library that implements the dot product for you and use that!

answered Feb 07 '17 at 16:12

skjerns

1,905
1
16
25

Thanks! I would like to try OpenBLAS from C#! – snaga Feb 08 '17 at 16:52

score 2 · Answer 2 · answered Feb 07 '17 at 16:43

Your code is doing matrix multiplication. There are fast algorithms for doing matrix multiplication and what you're doing is very slow O(n^3) [technically O(n*m^2) based on column/row length]. Plus you allocate the memory each time which isn't a good idea.

Resources for you:

Incidentally if you want the state of the art in desktop performance for this type of thing you might want to look into CUDA: https://en.wikipedia.org/wiki/CUDA

Thanks! I'm new to matrix operations and its optimization, so I'm going to learn it. CUDA technology as well. — snaga, Feb 08 '17 at 17:01

score 1 · Answer 3 · answered Feb 07 '17 at 16:19

1

Make your C# code be like python code: Know when your language can't keep up with the big dogs, and when that happens, call out to the native code in the resident BLAS subsystem for high performance parallel native optimized matrix math ops.

The resident BLAS subsystem is wrapped by a standard API. Your C# code will call the API, but will not know -- not knowing is a good thing! -- which particular BLAS subsystem is currently installed on the host.

I like OpenBLAS. Other people like Intel MKL(?). Still others like ATLAS. I hate ATLAS.

answered Feb 07 '17 at 16:19

Geoffrey Anderson

1,534
17
25

> Know when your language can't keep up with the big dogs, and when that happens, call out to the native code in the resident BLAS subsystem for high performance parallel native optimized matrix math ops. Yeah, actually, that's exactly what I wanted to learn, and I think it's time to learn BLAS and how to call it from C#. Thanks! – snaga Feb 08 '17 at 16:53

score 1 · Answer 4 · edited May 23 '17 at 11:45

If you need practical solution - use existing libraries.

If you are doing this for entertainment/educational purposes:

Eliminate all function calls from the innermost loop (GetLength) - any function calls can't be cached and result in significant slow down. Outer loops may benefit from same optimization, but will not give significant benefits.
Try to transpose second matrix first so inner loop accesses sequential elements for both arrays.
Try to use arrays of arrays instead of 2d array.
When using arras of arrays try to use Length in inner loop - which may eliminate bounds checks on at least one array
Try to parallelize outermost loop with Parallel.Foreach
If actual problem calls for more than one multiplication of non-square matrices - https://en.wikipedia.org/wiki/Matrix_chain_multiplication

Also use Stopwatch to measure time - Exact time measurement for performance testing

Thanks for great tips! Yeah, I'm new to C# and neural network/deep learning, so it is my toy project for me to master both C# and neural network/deep learning algorithms. I'm looking for performance tips to implement numeric algorithms in C#. I would like to try your tips asap. Thanks again! — snaga, Feb 08 '17 at 16:58

How to make my C# code faster to compute dot product

4 Answers4