2

I have some questions about Just-In-Time (JIT) compilation with CUDA.

I have implemented a library based on Expression Templates according to the paper

J.M. Cohen, "Processing Device Arrays with C++ Metaprogramming", GPU Computing Gems - Jade Edition

It seems to work fairly good. If I compare the computing time of the matrix elementwise operation

D_D=A_D*B_D-sin(C_D)+3.;

with that of a purposely developed CUDA kernel, I have the following results (in parentheses, the matrix size):

time [ms] hand-written kernel: 2.05 (1024x1024) 8.16 (2048x2048) 57.4 (4096*4096)

time [ms] LIBRARY: 2.07 (1024x1024) 8.17 (2048x2048) 57.4 (4096*4096)

The library seems to need approximately the same computing time of the hand-written kernel. I'm also using the C++11 keyword auto to evaluate expressions only when they are actually needed, according to Expression templates: improving performance in evaluating expressions?. My first question is

1. Which kind of further benefit (in terms of code optimization) would JIT provide to the library? Would JIT introduce any further burdening due to runtime compilation?

It is known that a library based on Expression Templates cannot be put inside a .dll library, see for example http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/00edbe1d-4906-4d91-b710-825b503787e2. My second question is:

2. Would JIT help in hiding the implementation to a third-party user? If yes, how?

The CUDA SDK include the ptxjit example in which the ptx code is not loaded at runtime, but defined at compile time. My third question is:

3. How should I implement JIT in my case? Are there examples of JIT using PTX loaded at runtime?

Thank you very much for any help.

EDIT following Talonmies' comment

From the Cuda kernel just-in-time (jit) compilation possible? post, it reads that

cuda code can be compiled to an intermediate format ptx code, which will then be jit-compiled to the actual device architecture machine code at runtime

A doubt I have is whether the above can be applied to an Expression Templates library. I know that, due to instantiation problems, a CUDA/C++ template code cannot be compiled to a PTX. But perhaps if I instantiate all the possible combinations of Type/Operators for Unary and Binary Expressions, at least a part of the implementation can be compiled (and then masked to third-party users) to PTX which can be in turn JIT compiled to the architecture at hand.

Community
  • 1
  • 1
Vitality
  • 20,705
  • 4
  • 108
  • 146
  • I got lost when you suddenly jumped from C++ metaprogramming to just in time compilation of PTX assembly. There seems to be some fairly major missing elements between the two. How will C++ template code get JIT compiled to PTX? – talonmies Apr 08 '13 at 19:13
  • Thank you for your comment. I edited my post trying to better explain the problem. My problem is mainly to hide the implementation (or a part of the implementation) and to improve the performance. – Vitality Apr 08 '13 at 20:42

2 Answers2

0

I think you should look into OpenCL. It provides a JIT-like programming model for creating, compiling, and executing compute kernels on GPUs (all at run-time).

I take a similar, expression-template based approach in Boost.Compute which allows the library to support C++ templates and generic algorithms by translating compile-type C++ expressions into OpenCL kernel code (which is a dialect of C).

Kyle Lutz
  • 7,966
  • 2
  • 20
  • 23
  • Thank you very much for your answer. But, could I "port" your answer to CUDA? In your experience, does JIT provide a further benefit or a further burdening due to runtime compilation? – Vitality Apr 15 '13 at 07:45
-1

VexCL started as an expression template library for OpenCL, but since v1.0 it also supports CUDA. What it does for CUDA is exactly JIT-compilation of CUDA sources. nvcc compiler is called behind the scenes, the compiled PTX is stored in an offline cache and is loaded on subsequent launches of the program. See CUDA backend sources for how to do this. compiler.hpp should probably be of most interest for you.

ddemidov
  • 1,731
  • 13
  • 15