Questions tagged [hpc]

High Performance Computing (HPC) refers to the use of supercomputers and computer clusters to solve a wide range of computationally intensive problems.

Systems with benchmark performance of 100s of teraflops are usually considered to be supercomputers. A typical feature of these supercomputers is that they have a large number of computing nodes, typically in the range of O(10^3) to O(10^6)). This distinguishes them from small-to-midsize computing clusters, which usually have O(10) to O(10^2) nodes.

When writing software that aims to make effective use of these resources, a number of challenges arise that are usually not present when working on single-core systems or even small clusters:


Higher degree of parallelization required

According to the original Sep-1966 formulation of the "classical" Law of diminishing returns -aka- Amdahl's Law, the maximum speedup one can achieve using parallel computers is restricted by the fraction of serial processes in your code (i.e. parts that can not be parallelized). That means the more processors you have, the better your parallelization concept has to be. The contemporary re-formulation, not ignoring add-on costs for process-spawning overheads, parameter / results SER/DES costs and the add-on costs of communications and last but not least the facts of resources-respecting, atomicity-of-work effects in the re-formulated overhead-strict revised Amdahl's Law, the add-on costs-adjusted comparisons more closely reflect the actual net-speedup benefits of True-[PARALLEL] code-execution(s), not ignoring the respective classes of add-on costs, related to the way, how such sections become prepared & executed.


Specialized hardware and software

Most supercomputers are custom-built and use specialized components for hardware and/or software, i.e. you have to learn a lot about new types of architectures if you want to get maximum performance. Typical examples are the network hardware, the file system, or the available compilers (including compiler optimization options).


Parallel file I/O becomes a serious bottleneck

Good parallel file systems handle multiple requests in parallel rather well. However, there is a limit to it, and most file systems do not support the simultaneous access of thousands of processes. Thus reading/writing to a single file internally becomes serialized again, even if you are using parallel I/O concepts such as MPI I/O.


Debugging massively parallel applications is a pain

If you have a problem in your code that only appears when you run it with a certain number of processes, debugging can become very cumbersome, especially if you are not sure where exactly the problem arises. Examples for process number-dependent problems are domain decomposition or the establishment of communication patterns.


Load balancing and communication patterns matter (even more)

This is similar to the first point. Assume that one of your computing nodes takes a little bit longer (e.g. one millisecond) to reach a certain point where all processes have to be synchronized. If you have 101 nodes, you only waste 100 * 1 millisecond = 0.1 s of computational time. However, if you have 100,001 nodes, you already waste 100 s. If this happens repeatedly (e.g. every iteration of a big loop) and if you have a lot of iterations, using more processors soon becomes non-economical.


Last but not least, the power

Thermal ceilings and power-"capping"-strategies are another dimension in fine-tuning the arena. End-to-end performance rules. The thermal-constrained and/or power-capping limitation pose another set of parameters, that decide on how to efficiently compute HPC-workloads withing the time- and capped-electric-power-constrained physical HPC-computing infrastructure. Because of many-fold differences, the scenarios do not obey an easily comprehensible choice, mostly being the very contrary ( contra-intuitive as per what is the optimum thermal- and power-capping configuration of the HPC-workload distribution over the computing infrastructure ), repeated workloads typically adapt these settings, as experience is being gathered ( like in weather-modelling ), as no sufficiently extensive ( so as to become decisive ) prior-testing was possible.

1502 questions
160
votes
5 answers

MPICH vs OpenMPI

Can someone elaborate the differences between the OpenMPI and MPICH implementations of MPI ? Which of the two is a better implementation ?
lava
  • 1,945
  • 2
  • 14
  • 15
43
votes
1 answer

HPC cluster: select the number of CPUs and threads in SLURM sbatch

The terminology used in the sbatch man page might be a bit confusing. Thus, I want to be sure I am getting the options set right. Suppose I have a task to run on a single node with N threads. Am I correct to assume that I would use --nodes=1 and…
Tanash
  • 461
  • 1
  • 5
  • 8
38
votes
1 answer

Use slurm job id

When I launch a computation on the cluster, I usually have a separate program doing the post-processing at the end : sbatch simulation sbatch --dependency=afterok:JOBIDHERE postprocessing I want to avoid mistyping and automatically have the good…
user1824346
  • 575
  • 1
  • 6
  • 17
37
votes
1 answer

Slurm: Why use srun inside sbatch?

In a sbatch script, you can directly launch programs or scripts (for example an executable file myapp) but in many tutorials people use srun myapp instead. Despite reading some documentation on the topic, I do not understand the difference and when…
RomualdM
  • 853
  • 8
  • 11
34
votes
2 answers

mpirun - not enough slots available

Usually when I use mpirun, I can "overload" it, using more processors than there acctually are on my computer. For example, on my four-core mac, I can run mpirun -np 29 python -c "print 'hey'" no problem. I'm on another machine now, which is…
kilojoules
  • 9,768
  • 18
  • 77
  • 149
34
votes
2 answers

Having windows Azure A8 nodes with InfiniBand support how to send N bytes from one and receive on another?

I like InfiniBand promise of 40Gbit/s network. My needs do not map onto the MPI model with one core node + slaves, and if possible I would prefer not to use MPI at all. I need simple connect/send/receive/close (or its async versions) API. Yet…
DuckQueen
  • 772
  • 10
  • 62
  • 134
32
votes
1 answer

comment in bash script processed by slurm

I am using slurm on a cluster to run jobs and submit a script that looks like below with sbatch: #!/usr/bin/env bash #SBATCH -o slurm.sh.out #SBATCH -p defq #SBATCH --mail-type=ALL #SBATCH --mail-user=my.email@something.com echo "hello" Can I…
user1981275
  • 13,002
  • 8
  • 72
  • 101
27
votes
4 answers

ANT Problems: net/sf/antcontrib/antcontrib.properties

I am attempting to install software onto my Debian Lenny server. Specifically, Capture-HPC. I have setup VMWare server, along with all the prerequisites. When I go to run ant in the directory, i get the following error: [taskdef] Could not load…
Julio
  • 2,261
  • 4
  • 30
  • 56
23
votes
0 answers

Is it possible to high performance computing by Golang and CUDA?

I've googled for a while and the only useful infos are: github.com/barnex/cuda5 mumax.github.io/ Unfortunately, the latest Arch Linux only provides CUDA 7.5 package, so the barnex's project may be not supported. Arne Vansteenkiste recommends…
Yang
  • 759
  • 2
  • 9
  • 30
22
votes
3 answers

Intel MKL vs. AMD Math Core Library

Does anybody have experience programming for both the Intel Math Kernel Library and the AMD Math Core Library? I'm building a personal computer for high performance statistical computations and am debating on the components to buy. An appeal of…
Tristan
  • 6,776
  • 5
  • 40
  • 63
21
votes
3 answers

How to find from where a job is submitted in SLURM?

I submitted several jobs via SLURM to our school's HPC cluster. Because the shell scripts all have the same name, so the job names appear exactly the same. It looks like [myUserName@rclogin06 ~]$ sacct -u myUserName JobID JobName …
Sibbs Gambling
  • 19,274
  • 42
  • 103
  • 174
19
votes
10 answers

Have you successfully used a GPGPU?

I am interested to know whether anyone has written an application that takes advantage of a GPGPU by using, for example, nVidia CUDA. If so, what issues did you find and what performance gains did you achieve compared with a standard CPU?
John Channing
  • 6,501
  • 7
  • 45
  • 56
16
votes
2 answers

GCC SSE code optimization

This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables…
Genís
  • 1,468
  • 2
  • 13
  • 24
16
votes
3 answers

Enumerating combinations in a distributed manner

I have a problem where I must analyse 500C5 combinations (255244687600) of something. Distributing it over a 10-node cluster where each cluster processes roughly 10^6 combinations per second means the job will be complete in about seven hours. The…
Matthieu N.
16
votes
1 answer

Why can't my ultraportable laptop CPU maintain peak performance in HPC

I have developed a high performance Cholesky factorization routine, which should have peak performance at around 10.5 GFLOPs on a single CPU (without hyperthreading). But there is some phenomenon which I don't understand when I test its performance.…
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
1
2 3
99 100