OpenMP

From Parawiki

Contents

History

OpenMP is an application programming interface (API) for Fortran and C++ developed in the early 90's to be used for multi-threaded, shared memory programming. It is not meant to be used in distributed memory parallel systems. The OpenMP standard specification started in 1997.

Release History

  • October 1997: Fortran version 1.0
  • Late 1998: C/C++ version 1.0
  • June 2000: Fortran version 2.0
  • April 2002: C/C++ version 2.0
  • June 2005: Combined C/C++ and Fortran version 2.5

OpenMP Structure

OpenMP is based on multiple theads working on the memory. It uses fork-join for parallel execution. OpenMP program structure is as follows:

  • the program is itself a master thread at the beginning
  • until reaching a parallel region, the master thread works sequentially
  • for the parallel region, the master thread creates some new threads
  • these new threads work in parallel
  • when the parallel work is completed, only the master thread continues to work

OpenMP Directives

Directives are based on #pragma directives in the C and C++ standards.

Directive Format

Each directive starts with #pragma omp to reduce the potential conflict with other #pragma directives. Directives are case-sensitive, the order in which clauses appear in the directive is not significant. The syntax of an OpenMP directive is:

#pragma omp directive-name [clause[ [,] clause]...] new-line</br>

Multiple directive names are not allowed. e.g.:

#pragma omp parallel barier /* ERROR */

Directives/Constructs

parallel Construct

Syntax:

#pragma omp parallel [clause[ [, ]clause] ...] new-line

for Construct

Syntax:

#pragma omp for[clause[[, ]clause] ...] new-line
for-loop

The iterations of the for loop are distributed across threads that already exist in the team executing the parallel construct which it binds. The for loop must have canonical shape so that the number of iterations can be computed on entry to the loop. Example:

#pragma omp parallel for
for(i = 0; i < 100; ++i) { }

Features

  • OpenMP enables and simplifies code reuse. The user is able to parallelize parts of a programm with small modifications using parallel directives. The parallelization is done by compiler directives, therefore it is possible to use a compiler with no OpenMP support to compile sequential versions. The directives will be skipped and the created programm will run sequentially.
  • User can control critical regions with powerful lock mechanisms. Furthermore OpenMP provides reduction of parallel regions e.g. REDUCTION(+: sum)

Disadvantages

  • It is difficult to parallelize loops with an unknown amount of iterations (may be adressed by a workqueue extension in the future).
  • Recursive programming leads to an overhead caused by an uncontrolled amount of intern parallel procedure calls.
  • Nested parallel regions are not supported by some compilers
  • Some "C" control structures cannot be used in parallel regions using some compilers (e.g. AIX C compiler (xlc) does not support break in parallel loops)
  • OpenMP simplifies parallel programming, but lacks some features of other parallel programming systems (e.g. something similar to condition variables)

Experiences

Speedups

We tested our implementations of the cowichan problems on two different machines. One is a dual AMD Opteron 248 machine with 4GB of memory running Redhat Enterprise Linux WS 64 bit. The other is a 8 processor machine running AIX. We used the Portland Group compiler 5.2-4 on the first one and the AIX xlc compiler on the second one. The charts below show the speedups we have obtained. Note: 2 problems are missing in the first chart (norm, vecdiff) because their execution times are too small to measure a speedup. To measure speedups we executed all problems in a chain with same matrix or vector sizes. Some problems can't handle large input sizes, but its these sizes we need for vecdiff and norm. And this fact makes these two problems not comparable with the execution times of the other programms.

The bad speedups for norm or vecdiff on the 8 processor machine are perhaps caused by the small size of the input data. Another problems with a strange speedup-behavior is gauss. As you can see the gauss problem has no speedup on the dual processor machine but on the 8 processor maschine with 2 threads the problem has a significant speedup. On the other hand, winnow shows a speedup on the dual processor machine but not on the 8 processor machine. Maybe this is caused by different hardware architectures.

We measured our speedups several times but we guess that a lot of other processes on the dual machine slowed down our programs.


Speedups on a dual AMD Opteron 248

Image:openmp_chart_amd.png

Speedups on a 8 processor machine

Image:openmp_chart.png

Speedups depending on input size

Image:openmp_size_speedup.png

The input size is multiplied by the Factor (X-axis). Factor = "1" is the origin input from Wilson (128x128 matrices, etc). As expected, the speedups depend on input size, because execution times of tests using small input size are suffering more under the overhead produced by the OpenMP library.

OpenMP Tools

DEEP is a performance analyser for OpenMP.
OpenMP Thread Checker by Intel.
OpenMP Thread Profiler by Intel.
Paraver : Visualization environment for MPI, OpenMP, Mixed, MLP, Java.
Etnus: A debugger for Linux and Unix applications. It supports MPI and OpenMP.

Related Links