CUDA or FPGA for special purpose 3D graphics computations?

Question

I am developing a product with heavy 3D graphics computations, to a large extent closest point and range searches. Some hardware optimization would be useful. While I know little about this, my boss (who has no software experience) advocates FPGA (because it can be tailored), while our junior developer advocates GPGPU with CUDA, because its cheap, hot and open. While I feel I lack judgement in this question, I believe CUDA is the way to go also because I am worried about flexibility, our product is still under strong development.

So, rephrasing the question, are there any reasons to go for FPGA at all? Or is there a third option?

Would love to know what people think how Cell stacks up against the two. — 0fnt, Jun 11 '12 at 13:21

score 49 · Accepted Answer · answered Nov 25 '08 at 15:48

49

I investigated the same question a while back. After chatting to people who have worked on FPGAs, this is what I get:

FPGAs are great for realtime systems, where even 1ms of delay might be too long. This does not apply in your case;
FPGAs can be very fast, espeically for well-defined digital signal processing usages (e.g. radar data) but the good ones are much more expensive and specialised than even professional GPGPUs;
FPGAs are quite cumbersome to programme. Since there is a hardware configuration component to compiling, it could take hours. It seems to be more suited to electronic engineers (who are generally the ones who work on FPGAs) than software developers.

If you can make CUDA work for you, it's probably the best option at the moment. It will certainly be more flexible than a FPGA.

Other options include Brook from ATI, but until something big happens, it is simply not as well adopted as CUDA. After that, there's still all the traditional HPC options (clusters of x86/PowerPC/Cell), but they are all quite expensive.

Hope that helps.

answered Nov 25 '08 at 15:48

biozinc

4,629
2
25
28

39

"CUDA will be certainly more flexible than a FPGA" is false. For CUDA, you have to twist and turn your algorithm in very specific ways to enjoy the speed-up. With FPGAs you can do whatever you want - i.e. implement specialized computation routines tailored just for your algorithm. Granted, this requires HDL programming knolwedge, so CUDA is indeed more accessible for software programmers. – Eli Bendersky Apr 30 '09 at 05:51
6

FPGAs can now be programmed using OpenCL - https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.html This should make FPGAs more attractive to software programmers. – ProfNimrod Jan 28 '16 at 14:56
1

Here is a great [article](http://mil-embedded.com/articles/fpga-gpu-evolution-continues/) discussing why the U.S. military is moving away from FPGA in favor of GPU. It discusses the floating point precision, latency, direct memory access and power consumption differences between the two. – Stan Mar 01 '17 at 15:46
1

This pro-FPGA [article](https://www.nextplatform.com/2016/03/17/fpgas-can-take-gpus-knights-landing/) points out that you "have to not look just at raw compute but how their models can scale across the various memory in a compute element and across multiple elements lashed together inside of a node and across nodes". – Stan Mar 01 '17 at 15:56

score 49 · Answer 2 · answered Dec 02 '08 at 13:26

We did some comparison between FPGA and CUDA. One thing where CUDA shines if you can realy formulate your problem in a SIMD fashion AND can access the memory coalesced. If the memory accesses are not coalesced(1) or if you have different control flow in different threads the GPU can lose drastically its performance and the FPGA can outperform it. Another thing is when your operation is realtive small, but you have a huge amount of it. But you cant (e.g. due to synchronisation) no start it in a loop in one kernel, then your invocation times for the GPU kernel exceeds the computation time.

Also the power of the FPGA could be better (depends on your application scenarion, ie. the GPU is only cheaper (in terms of Watts/Flop) when its computing all the time).

Offcourse the FPGA has also some drawbacks: IO can be one (we had here an application were we needed 70 GB/s, no problem for GPU, but to get this amount of data into a FPGA you need for conventional design more pins than available). Another drawback is the time and money. A FPGA is much more expensive than the best GPU and the development times are very high.

(1) Simultanously accesses from different thread to memory have to be to sequential addresses. This is sometimes really hard to achieve.

Nice answer. While the other answers confirmed what we already researched, you provided some concrete examples when either or may be better. Thanks. — Fredriku73, Dec 13 '08 at 16:10
There is something wrong with the 70GB/s value? The newest Fermi (2010) has 16x PCIe v2.0 lanes and this is 8GB/s. The on-card memory (GDDR5) can reach up to 54.4GB/s. This is fast but there is only few GB available. — name, Sep 02 '10 at 13:45
A high-end FPGA can definitely outperform GPGPU in terms of IO bandwidth. One PCIe Gen2 x16 interface delivers 16*500MByte*0.8 (8b/10b data encoding scheme) = 6.4GB/sec useful payload. Multiply that by at least 10 for high end Xilinx Virtex6 FPGAs with 48 transceivers — OutputLogic, Feb 16 '11 at 05:00
My original post was from 2008, Virtex6 doesnt exist then (hell, its even in 2011 hard to get high quantity volumes of them) - but your are right, today 6.4 x 10 GB/sec = 64 GB/s are possible, but GPGPU is also envolving and their throughput is nowadays > 120 GB/s. — flolo, Mar 29 '11 at 13:18

score 15 · Answer 3 · answered Nov 25 '08 at 16:20

15

I would go with CUDA.
I work in image processing and have been trying hardware add-ons for years. First we had i860, then Transputer, then DSP, then the FPGA and direct-compiliation-to-hardware.
What innevitably happened was that by the time the hardware boards were really debugged and reliable and the code had been ported to them - regular CPUs had advanced to beat them, or the hosting machine architecture changed and we couldn't use the old boards, or the makers of the board went bust.

By sticking to something like CUDA you aren't tied to one small specialist maker of FPGA boards. The performence of GPUs is improving faster then CPUs and is funded by the gamers. It's a mainstream technology and so will probably merge with multi-core CPUs in the future and so protect your investment.

answered Nov 25 '08 at 16:20

Martin Beckett

94,801
28
188
263

CPUs aren't advancing that much anymore. However, we now have Xeon Phi (512-bit SIMD) which are similar. – Dmitri Nesteruk Mar 07 '14 at 13:29
I hear you @MartinBeckett about being FPGA-independent. But keep in mind that nvidia's UnifiedDeviceArch shines on nVIDIA's chips only ;-) So you still get the dependency. That's why OpenCL 2.0 with a SPIR based on LLVM's strong codebase seems like the way to go (February 2016) – Nikolaos Giotis Feb 02 '16 at 11:13
@NikYotis, this was written in 08 and I did say "something like CUDA". Today I would look at OpenCL for a general problem but CUDA probably has the edge if you need most performance now – Martin Beckett Feb 02 '16 at 13:54
1

sure @MartinBeckett I hope OpenCL 2.0 w/ the SPIR-V and AMD Radeon will put an end to nVIDIA's monopoly – Nikolaos Giotis Feb 03 '16 at 12:41
I Wonder how long was your development/debug cycle for "regular CPUs had advanced to beat them." Right now I get FPGA's 1-2 tech nodes behind the CPU And there is no way a iterative program on a CPU can beat the performance of parallel and pipelined execution on an FPGA.... – vijayvithal Nov 27 '17 at 15:40

score 10 · Answer 4 · edited Jun 20 '20 at 09:12

FPGAs

What you need:
- Learn VHDL/Verilog (and trust me you don't want to)
- Buy hw for testing, licences for synthesis tools
- If you already have infrastructure and you need to develop only your core
  - Develop design ( and it can take years )
- If you don't:
  - DMA, hw driver, ultra expensive synthesis tools
  - tons of knowledge about buses, memory mapping, hw synthesis
  - build the hw, buy the ip cores
  - Develop design
  - Not mentioning of board developement
For example average FPGA pcie card with chip Xilinx ZynqUS+ costs more than 3000$
FPGA cloud is also costly 2$/h+
Result:
- This is something which requires resources of running company at least.

GPGPU (CUDA/OpenCL)

You already have hw to test on.
Compare to FPGA stuff:
- Everything is well documented .
- Everything is cheap
- Everything works
- Everything is well integrated to programming languages
There is GPU cloud as well.
Result:
- You need to just download sdk and you can start.

Learning OpenCL is now sufficient to program FPGAs - https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.html and given the primary importance of power combined with the lower power requirements of FPGAs for similar performance, FPGAs now offer an attractive alternative to GPU for many problem types. — ProfNimrod, Jan 28 '16 at 15:01
I agree with @ProfNimrod nvidia's UnifiedDeviceArch shines on nVIDIA's chips only ;-) So you still get the dependency. That's why OpenCL 2.0 with a SPIR based on LLVM's strong codebase seems like the way to go (February 2016) — Nikolaos Giotis, Feb 02 '16 at 11:15

score 4 · Answer 5 · answered Aug 15 '09 at 14:42

Obviously this is a complex question. The question might also include the cell processor. And there is probably not a single answer which is correct for other related questions.

In my experience, any implementation done in abstract fashion, i.e. compiled high level language vs. machine level implementation, will inevitably have a performance cost, esp in a complex algorithm implementation. This is true of both FPGA's and processors of any type. An FPGA designed specifically to implement a complex algorithm will perform better than an FPGA whose processing elements are generic, allowing it a degree of programmability from input control registers, data i/o etc.

Another general example where an FPGA can be much higher performance is in cascaded processes where on process outputs become the inputs to another and they cannot be done concurrently. Cascading processes in an FPGA is simple, and can dramatically lower memory I/O requirements while processor memory will be used to effectively cascade two or more processes where there are data dependencies.

The same can be said of a GPU and CPU. Algorithms implemented in C executing on a CPU developed without regard to the inherent performance characteristics of the cache memory or main memory system will not perform as well as one implemented which does. Granted, not considering these performance characteristics simplifies implementation. But at a performance cost.

Having no direct experience with a GPU, but knowing its inherent memory system performance issues, it too will be subjected to performance issues.

score 4 · Answer 6 · answered May 05 '17 at 19:49

This is an old thread started in 2008, but it would be good to recount what happened to FPGA programming since then: 1. C to gates in FPGA is the mainstream development for many companies with HUGE time saving vs. Verilog/SystemVerilog HDL. In C to gates System level design is the hard part. 2. OpenCL on FPGA is there for 4+ years including floating point and "cloud" deployment by Microsoft (Asure) and Amazon F1 (Ryft API). With OpenCL system design is relatively easy because of very well defined memory model and API between host and compute devices.

Software folks just need to learn a bit about FPGA architecture to be able to do things that are NOT EVEN POSSIBLE with GPUs and CPUs for the reasons of both being fixed silicon and not having broadband (100Gb+) interfaces to the outside world. Scaling down chip geometry is no longer possible, nor extracting more heat from the single chip package without melting it, so this looks like the end of the road for single package chips. My thesis here is that the future belongs to parallel programming of multi-chip systems, and FPGAs have a great chance to be ahead of the game. Check out http://isfpga.org/ if you have concerns about performance, etc.

The topic is more suitable to be posted in [Electrical Engineering](https://electronics.stackexchange.com/) or [Data Science](https://datascience.stackexchange.com/). — thewaywewere, May 05 '17 at 20:37
OpenCL on FPGAs removes these divisions. The FPGA logic fabric becomes compute resource. — My Name, Jan 10 '19 at 16:46

score 3 · Answer 7 · edited Jul 24 '09 at 02:14

3

FPGA-based solution is likely to be way more expensive than CUDA.

edited Jul 24 '09 at 02:14

Bill the Lizard

398,270
210
566
880

answered Jun 24 '09 at 06:54

OutputLogic

756
4
12

2

You need to quantify this. More expensive per watt? – Dmitri Nesteruk Mar 07 '14 at 13:29
This argument is no longer true, plus you will have to wait few years for the next level GPU. With FPGAs you can scale OpenCL code now. Search for: scalable OpenCL FPGA - second hit. – My Name Mar 27 '18 at 15:40

score 3 · Answer 8 · answered Oct 21 '09 at 20:26

What are you deploying on? Who is your customer? Without even know the answers to these questions, I would not use an FPGA unless you are building a real-time system and have electrical/computer engineers on your team that have knowledge of hardware description languages such as VHDL and Verilog. There's a lot to it and it takes a different frame of mind than conventional programming.

score 3 · Answer 9 · answered Nov 20 '09 at 12:47

I'm a CUDA developer with very littel experience with FPGA:s, however I've been trying to find comparisons between the two.

What I've concluded so far:

The GPU has by far higher ( accessible ) peak performance It has a more favorable FLOP/watt ratio. It is cheaper It is developing faster (quite soon you will literally have a "real" TFLOP available). It is easier to program ( read article on this not personal opinion)

Note that I'm saying real/accessible to distinguish from the numbers you will see in a GPGPU commercial.

BUT the gpu is not more favorable when you need to do random accesses to data. This will hopefully change with the new Nvidia Fermi architecture which has an optional l1/l2 cache.

my 2 cents

score 3 · Answer 10 · answered Jun 10 '15 at 19:16

Others have given good answers, just wanted to add a different perspective. Here is my survey paper published in ACM Computing Surveys 2015 (its permalink is here), which compares GPU with FPGA and CPU on energy efficiency metric. Most papers report: FPGA is more energy efficient than GPU, which, in turn, is more energy efficient than CPU. Since power budgets are fixed (depending on cooling capability), energy efficiency of FPGA means one can do more computations within same power budget with FPGA, and thus get better performance with FPGA than with GPU. Of course, also account for FPGA limitations, as mentioned by others.

Power is only one aspect where FPGAs wins. The FPGAs have broadband interfaces allowing them to take data directly bypassing system memory of a server. On FPGAs we can forget John von Neumann all together and do much better jobs than GPUs/CPUs. — My Name, May 24 '18 at 20:24

ConcernedOfTunbridgeWells · Answer 11 · 2008-11-25T16:10:52.900

CUDA has a fairly substantial code base of examples and a SDK, including a BLAS back-end. Try to find some examples similar to what you are doing, perhaps also looking at the GPU Gems series of books, to gauge how well CUDA will fit your applications. I'd say from a logistic point of view, CUDA is easier to work with and much, much cheaper than any professional FPGA development toolkit.

At one point I did look into CUDA for claim reserve simulation modelling. There is quite a good series of lectures linked off the web-site for learning. On Windows, you need to make sure CUDA is running on a card with no displays as the graphics subsystem has a watchdog timer that will nuke any process running for more than 5 seconds. This does not occur on Linux.

Any mahcine with two PCI-e x16 slots should support this. I used a HP XW9300, which you can pick up off ebay quite cheaply. If you do, make sure it has two CPU's (not one dual-core CPU) as the PCI-e slots live on separate Hypertransport buses and you need two CPU's in the machine to have both buses active.

Audrius Meškauskas · Answer 12 · 2018-01-25T16:51:32.100

FPGAs are more parallel than GPUs, by three orders of magnitude. While good GPU features thousands of cores, FPGA may have millions of programmable gates.
While CUDA cores must do highly similar computations to be productive, FPGA cells are truly independent from each other.
FPGA can be very fast with some groups of tasks and are often used where a millisecond is already seen as a long duration.
GPU core is way more powerful than FPGA cell, and much easier to program. It is a core, can divide and multiply no problem when FPGA cell is only capable of rather simple boolean logic.
As GPU core is a core, it is efficient to program it in C++. Even it it is also possible to program FPGA in C++, it is inefficient (just "productive"). Specialized languages like VDHL or Verilog must be used - they are difficult and challenging to master.
Most of the true and tried instincts of a software engineer are useless with FPGA. You want a for loop with these gates? Which galaxy are you from? You need to change into the mindset of electronics engineer to understand this world.

I could not agree with this comment at all. It started well, but ended with rather not quite right information. Let us keep it to the facts: 1. Both major FPGA vendors have C driven flow for FPGAs. 2. Granted the hick-ups circa 2013-2015 OpenCL is stable and mature from both vendors. No need to resort to Verilog or VHDL.3. Yes, for loops work just great in C/OpenCL for FPGAs and some even faster than GPUs. 4. The mind set of the Software Engineers has to change - true. Cluster computing is the future: TPU, DGX-1, Azure, AWS... 5. FPGAs are NOT FIXED SILICON - ANYTHING IS POSSIBLE in FPGAs now. — My Name, Mar 27 '18 at 14:51

score 2 · Answer 13 · answered Jul 20 '16 at 01:44

FPGA will not be favoured by those with a software bias as they need to learn an HDL or at least understand systemC.

For those with a hardware bias FPGA will be the first option considered.

In reality a firm grasp of both is required & then an objective decision can be made.

OpenCL is designed to run on both FPGA & GPU, even CUDA can be ported to FPGA.

FPGA & GPU accelerators can be used together

So it's not a case of what is better one or the other. There is also the debate about CUDA vs OpenCL

Again unless you have optimized & benchmarked both to your specific application you can not know with 100% certainty.

Many will simply go with CUDA because of its commercial nature & resources. Others will go with openCL because of its versatility.

score 1 · Answer 14 · answered Mar 23 '13 at 02:15

1

at latest GTC'13 many HPC people agreed that CUDA is here to stay. FGPA's are cumbersome, CUDA is getting quite more mature supporting Python/C/C++/ARM.. either way, that was a dated question

answered Mar 23 '13 at 02:15

Nikolaos Giotis

281
2
23

1

Programming FPGAs is now much easier as OpenCL is supported - https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.html – ProfNimrod Jan 28 '16 at 14:59
1

I 100% second this. C'mon pick up the book on OpenCL and read. When you going to one company gathering you will be fed very biased information. – My Name May 24 '18 at 20:26

Zdovc · Answer 15 · 2018-04-10T12:26:15.123

Programming a GPU in CUDA is definitely easier. If you don't have any experience with programming FPGAs in HDL it will almost surely be too much of a challenge for you, but you can still program them with OpenCL which is kinda similar to CUDA. However, it is harder to implement and probably a lot more expensive than programming GPUs.

Which one is Faster?

GPU runs faster, but FPGA can be more efficient.

GPU has the potential of running at a speed higher than FPGA can ever reach. But only for algorithms that are specially suited for that. If the algorithm is not optimal, the GPU will loose a lot of performance.

FPGA on the other hand runs much slower, but you can implement problem-specific hardware that will be very efficient and get stuff done in less time.

It's kinda like eating your soup with a fork very fast vs. eating it with a spoon more slowly.

Both devices base their performance on parallelization, but each in a slightly different way. If the algorithm can be granulated into a lot of pieces that execute the same operations (keyword: SIMD), the GPU will be faster. If the algorithm can be implemented as a long pipeline, the FPGA will be faster. Also, if you want to use floating point, FPGA will not be very happy with it :)

I have dedicated my whole master's thesis to this topic. Algorithm Acceleration on FPGA with OpenCL

A link to a solution is welcome, but please ensure your answer is useful without it: [add context around the link](//meta.stackexchange.com/a/8259) so your fellow users will have some idea what it is and why it’s there, then quote the most relevant part of the page you're linking to in case the target page is unavailable. [Answers that are little more than a link may be deleted.](//stackoverflow.com/help/deleted-answers) — geisterfurz007, Apr 10 '18 at 12:13

CUDA or FPGA for special purpose 3D graphics computations?

15 Answers15

FPGAs

GPGPU (CUDA/OpenCL)

Linked