Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:

Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.

Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).

Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.

Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.

Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.

Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the hpc arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions

160

votes

5 answers

MPICH vs OpenMPI

Can someone elaborate the differences between the OpenMPI and MPICH implementations of MPI ? Which of the two is a better implementation ?

mpi hpc openmpi

asked Mar 11 '10 at 17:58

lava

1,945
2
14
15

votes

1 answer

HPC cluster: select the number of CPUs and threads in SLURM sbatch

The terminology used in the sbatch man page might be a bit confusing. Thus, I want to be sure I am getting the options set right. Suppose I have a task to run on a single node with N threads. Am I correct to assume that I would use --nodes=1 and…

multithreading parallel-processing mpi hpc slurm

asked Jul 02 '18 at 15:45

Tanash

votes

1 answer

Use slurm job id

When I launch a computation on the cluster, I usually have a separate program doing the post-processing at the end : sbatch simulation sbatch --dependency=afterok:JOBIDHERE postprocessing I want to avoid mistyping and automatically have the good…

linux batch-processing hpc slurm

asked Nov 13 '13 at 17:23

user1824346

votes

1 answer

Slurm: Why use srun inside sbatch?

In a sbatch script, you can directly launch programs or scripts (for example an executable file myapp) but in many tutorials people use srun myapp instead. Despite reading some documentation on the topic, I do not understand the difference and when…

hpc slurm

asked Dec 05 '18 at 16:32

RomualdM

votes

2 answers

mpirun - not enough slots available

Usually when I use mpirun, I can "overload" it, using more processors than there acctually are on my computer. For example, on my four-core mac, I can run mpirun -np 29 python -c "print 'hey'" no problem. I'm on another machine now, which is…

mpi openmpi hpc

asked Feb 29 '16 at 16:32

kilojoules

9,768
18
77
149

votes

2 answers

Having windows Azure A8 nodes with InfiniBand support how to send N bytes from one and receive on another?

I like InfiniBand promise of 40Gbit/s network. My needs do not map onto the MPI model with one core node + slaves, and if possible I would prefer not to use MPI at all. I need simple connect/send/receive/close (or its async versions) API. Yet…

c++ azure hpc infiniband azure-virtual-network

asked Apr 26 '15 at 15:48

DuckQueen

votes

1 answer

comment in bash script processed by slurm

I am using slurm on a cluster to run jobs and submit a script that looks like below with sbatch: #!/usr/bin/env bash #SBATCH -o slurm.sh.out #SBATCH -p defq #SBATCH --mail-type=ALL #SBATCH --mail-user=my.email@something.com echo "hello" Can I…

bash comments hpc slurm

asked Oct 31 '16 at 16:58

user1981275

13,002
8
72
101

votes

4 answers

ANT Problems: net/sf/antcontrib/antcontrib.properties

I am attempting to install software onto my Debian Lenny server. Specifically, Capture-HPC. I have setup VMWare server, along with all the prerequisites. When I go to run ant in the directory, i get the following error: [taskdef] Could not load…

java ant debian hpc

asked Oct 25 '10 at 03:34

Julio

2,261
4
30
56

votes

0 answers

Is it possible to high performance computing by Golang and CUDA?

I've googled for a while and the only useful infos are: github.com/barnex/cuda5 mumax.github.io/ Unfortunately, the latest Arch Linux only provides CUDA 7.5 package, so the barnex's project may be not supported. Arne Vansteenkiste recommends…

go cuda opencl archlinux hpc

asked Feb 18 '16 at 12:39

Yang

votes

3 answers

Intel MKL vs. AMD Math Core Library

Does anybody have experience programming for both the Intel Math Kernel Library and the AMD Math Core Library? I'm building a personal computer for high performance statistical computations and am debating on the components to buy. An appeal of…

math optimization intel hpc amd-processor

asked Oct 29 '09 at 16:22

Tristan

6,776
5
40
63

votes

3 answers

How to find from where a job is submitted in SLURM?

I submitted several jobs via SLURM to our school's HPC cluster. Because the shell scripts all have the same name, so the job names appear exactly the same. It looks like [myUserName@rclogin06 ~]$ sacct -u myUserName JobID JobName …

linux hpc slurm

asked Jul 05 '14 at 13:00

Sibbs Gambling

19,274
42
103
174

votes

10 answers

Have you successfully used a GPGPU?

I am interested to know whether anyone has written an application that takes advantage of a GPGPU by using, for example, nVidia CUDA. If so, what issues did you find and what performance gains did you achieve compared with a standard CPU?

cuda gpgpu hpc

asked Sep 10 '08 at 21:59

John Channing

6,501
7
45
56

votes

2 answers

GCC SSE code optimization

This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables…

c optimization sse compiler-optimization hpc

asked Oct 27 '11 at 16:38

Genís

1,468
2
13
24

votes

3 answers

Enumerating combinations in a distributed manner

I have a problem where I must analyse 500C5 combinations (255244687600) of something. Distributing it over a 10-node cluster where each cluster processes roughly 10^6 combinations per second means the job will be complete in about seven hours. The…

c++ algorithm distributed combinatorics hpc

asked Jan 15 '11 at 08:15

Matthieu N.

votes

1 answer

Why can't my ultraportable laptop CPU maintain peak performance in HPC

I have developed a high performance Cholesky factorization routine, which should have peak performance at around 10.5 GFLOPs on a single CPU (without hyperthreading). But there is some phenomenon which I don't understand when I test its performance.…

performance x86 intel hpc cpu-speed

asked Apr 01 '16 at 18:41

Zheyuan Li

71,365
17
180
248

2 3

…

99 100 Next