Message Passing Interface

From Parawiki

(Redirected from MPI)

Contents

History

MPI is a standard portable message-passing library developed in 1993 by a group of parallel computer vendors, software writers, and application scientists. MPI is available to both Fortran and C programs. It is a standard for communication among processors working in parallel on a distributed memory system. It is not a system, but a standard. The first version MPI-1 appeared in 1994, the enhanced standard MPI-2 was released in 1997.

Very common portable implementations of MPI are MPICH, MPICH2 and LAM MPI. The following link tries to give a full listing of all available MPI implementations (provided by the LAM MPI team) Listing of MPI Implementations

MPI Communication

MPI communication is based on process communication . Processes working in parallel but on different data exchange information. MPI can be used for SPMD and MPMD programming styles. It cannot be used only on one computer but across a whole lot of networked computers. Exchanging data between processes requires process cooperation. It is based on send and receive routines.

Peer to Peer Communication

Blocking Communication

A typical blocking send looks like

  if (threadID == 0) {
     MPI_Recv(data, dataLength, MPI_INT, 1, 69, MPI_COMM_WORLD, &status);
  } else {
     MPI_Send(data, dataLength, MPI_INT, 0, 69, MPI_COMM_WORLD);
  }
  // continue with the programm

In this Example data from process 1 is send via blocking send/receive to process 0. The program waits until the data has been send and received. After that it continues.

An alternative to above is to use MPI_Sendrecv:

  MPI_Sendrecv(data, dataLength, MPI_INT, 0, 69, data, dataLength, MPI_INT, 1, 69, MPI_COMM_WORLD, &status);

NonBlocking Communication

A nonblocking post-send initiates a send operationm but does not complte ist. The post-send may return before thje message is copied out to the send buffer. Similarly, a nonblocking post-receive initiates a receive operarion, but does not complete it. The call may return before a message is stored into the receive buffer. A sperate complete-receive is needed to verify that the receive operation has completed.

Collective Communication

The Example below shows how a matrix could be broadcast to all other processes:

  int *matrixGlobal=new int[numRows*numCols];
  MPI_Bcast(matrixGlobal, numRows*numCols, MPI_INT, 0, MPI_COMM_WORLD);

Often it is not desirable to send the whole matrix to all processes just for the case they only need a subset of the original matrix. Example for scatter data:

  int *matrixGlobal = new int[numRows*numCols];
  int *matrixLocal = new int[localNumRows*numCols];
  int *elementCounts = ... // number of elements on each thread
  int *elementDispls = ... // global offset of first element on each process
  MPI_Scatterv(matrixGlobal, elementCounts, elementDispls, MPI_INT, matrixLocal, elementCounts[rankID], MPI_INT, 0, MPI_COMM_WORLD);

Allgatherv is needed if an across processes distributed matrix may be gathered so that each process has an actual copy of the entire matrix

  int *matrixGlobal = new int[numRows*numCols];
  int *matrixLocal = new int[localNumRows*numCols];
  int *elementCounts = ... // number of elements on each thread
  int *elementDispls = ... // global offset of first element on each process
  MPI_Allgatherv(matrixLocal, localNumRows*numCols,MPI_INT, matrixGlobal, elementCounts, elementDispls,MPI_INT,MPI_COMM_WORLD );


And finally an example shows how to gather distributed at the root process:

  int *matrixGlobal = new int[numRows*numCols];
  int *matrixLocal = new int[localNumRows*numCols];
  int *elementCounts = ... // number of elements on each thread
  int *elementDispls = ... // global offset of first element on each process
  MPI_Gatherv(matrixLocal, elementCounts[rankID], MPI_INT, matrixGlobal, elementCounts, elementDispls, MPI_INT, 0, MPI_COMM_WORLD);

MPI program structure

  • by executing the program it is transferred to all processes that are involved. The program is then executed on each process on its own.
  • All MPI Functions (Inter Process Communication later referred to as IPC) must occur between initialization and finalization. Those two function calls usually enclose most of the programs code. There is no need in repeatedly using initialization's and finalizations for defining IPC enabled sections.

A typical MPI program structure is as follows:

mpi_example.c :

	#include <mpi.h>
	...
	int main(int *argc, char *argv[]) {	
		....
		MPI_Init(&argc, &argv);        //initialization
		/* MPI functions 
                  only in between */
		MPI_Finalize();                //finalization
		....
	}

Sample Programs

Simple hello world!
Hello world with send and receive

Features

Advantages

  • own data types (can be defined for using any data structure with MPI functions)
  • own Reduce functions
  • plenty of communication functions especially in MPI-2 (of course, this is all about message passing)
  • nearly infinite computation power through use of networked computers (there may be restrictions by the MPI implementation)
  • good use of cache through local memory (ideally one process/cpu/cache)
  • MPI is a standard not a system (not proprietary => many implementations for many systems)
  • explicit parallel programming – means explicit use of communication functions (lots available, simple and complex ones) the right way.

Disadvantages

  • declaration of own data types can be tricky
  • having to deal explicitly with message passing (all communication and calculations (displacements (offset), send- and receive-counts, memory allocation))
  • huge differences between serial and MPI implementation of the same program
  • requires a special runtime environment (consequence of inter-computer-communication)
  • no dynamic spawning of processes
  • processes can only access local memory directly, unlike e.g.: OpenMP (certainly only concerns those processes running on the same machine). Of course this could be corrected by using MPI and OpenMP concurrently.
  • as a standard you never know the way and how efficient communication is implemented
  • difficult to debug (mostly commercial tools found), there used to be a visualizer for LAM MPI for Linux
  • explicit parallel programming can be tricky and takes time to learn



Cowichan Problems

The Cowichan Problems were invented from Mr. Wilson to provide an objective basis for comparing different existing parallel programming systems. A sequential version of all the problems was coded by Mr. Wilson in ANSI C. He wrote serial and parallel code (using pthreads). His programs were used as a reference implementation for checking that the parallel versions produced the correct results. Information on the Cowichan-Problems can be found here.


Distributed data structures

In many MPI programs matrices and vectors must be distributed among the processes. The standard methods to distribute, scatter and gather data in MPI are collective communication functions. These are needed if more than two processes are involved in communication. These need information how to split the data, where and how much to send, where and how much to receive and which processes are involved. These can be easily computed if data is distributed evenly among all processes. However, if data needs to distributed in more complex ways (e.g.: cyclic borders) more complex calculations are needed. For this case a generic object orientated data structure class was invented. This class can hold a matrix of any type. It holds calculates and stores information how it is to be distributed as well.


The following shows the main structure of the class:


  template<class T>
  class Distribution {
    private:
      int numThreads; // the number of all threads
      int *elementCounts, *elementDispls; // the displacement and counts
    public:
      T *data;
      T& operator[] (int element) {
         return data[element]; 
      }
      // extra methods needed for calculating elementCounts,...
      // plus some getters and setters
  }


Cowichan Problems

Random Number Generation

Generates and fills a matrix with random numbers. The problem of the parallel implementation is that each successive field in the matrix is computed from the previous field. That implies that the data has a sequential dependency. The output must be repeatable and independent from the numbers of processes. Each process calculates a fraction of the matrix. As the data has a sequential dependency each process needs to have an initial value to start with. This initial seed of random values is computed before the actual computation. Then each process computes the random numbers and stores them in a local matrix. The process local matrices are then merged at the root process. Our implementation looses quite some time doing the merging of the matrices at the end. This is due to the fact that the implementation of Mr. Wilson (inventor of the Cowichan Problems) has the worker generated random numbers intermixed. As he only did a serial and a shared memory implementation he was never addressed with the problem of merging matrices intermixed. Doing these array operations takes time but computes the exact same output as the reference implementations. To speedup the process, the random numbers can be merged using a Gatherv for vectors. The numbers then will still be correct just having a different sequence.

Image Thresholding

Takes an integer image and performs an image threshold. Outputs a binary image that shows which pixels are brighter and which are darker than a given percentage.

Each process holds a local fraction of the distributed image. After the highest values of the image fractions are computed a reduction is used to find the highest value at all. Then a local histogram computation is performed. After a vector reduction, the root process holds a complete histogram. The threshold brightness is then computed and broadcasted. A parallel computation of the binary matrix follows.

Invasion percolation

Invasion percolation simulates the flow of a liquid over a solid ground. The input is a matrix of integers representing rock densities. If the liquid reaches another field it is marked. During each iteration all neighbors of all filled cells are examined and the one with the lowest value is marked. Apparently this is a step by step iterative process what makes it hard to parallelize.

We had three ideas to implement this module.

1. The naïve implementation splits the matrix among all processes. Each process searches in its local matrix for the lowest neighboring field. The results are collected at the root process. The smallest field is broadcasted. If the field is part of a process’s fraction of the matrix it is marked. A very slow solution, since only one process at a time is allowed to work. It is still a serial implementation using multiple processes waiting for each other. Furthermore a real serial implementation uses a priority queue. This queue holds the next possible fields to mark. Each iteration the top field of the queue is marked and its neighbors added. There are still some things to check but we don’t want to discuss this in depth here.

2. An alternative approach is to use a parallel priority queue. If a new field is marked, its neighbors are added to the queue. The process to add a field to the queue takes longer every time a new field is added. More speed is gained the larger the number of elements in the queue. Also the second approach seems very promising we were not able to implement it. After all it is the one idea that has the potential to give real speedups. Although this would only happen if there are an enormous amount of elements already in the priority queue. It would be smart to run the program in serial mode and if a certain threshold of elements in the queue is met, enqueueing is parallelized.

3. Another idea is to let threads run ahead. This means going the most likely way.

Game of Life

This program simulates Conway's "Game of Life" a 2D cellular automaton. At each time step, this module must count the number of live (true) neighbors of each cell. Neighbors are orthogonal and diagonal fields, using circular boundaries. The update rule is simple. If a cell has 3 or 2 live neighbors and is already alive, it is alive in the next generation. In any other situation, the cell becomes or stays dead. After each iteration every thread receives updated borders from its neighbor threads (upper and lower borders).

For realizing the boundary communication MPI offer virtual topologies. But virtual topologies are as powerful as they are complicated. Because we found neither a code example nor adequate explanations we choose to implement the boundary communications ourselves.

Like in the Game of Life there is a special case of less than three points are assigned on a process. In this case it is not defined which is the data to calculate and what are the borders.


Elastic Net (TSP)

This module uses the elastic net algorithm to find an approximate shortest way trough a given number of cities (known as travelling salesman problem TSP). The vector of cities is broadcasted to all processes. The closed, elastic, circular loop of connected net points is computed locally. For annealing the net points towards the cities a delta vector (size of net points) is computed parallel using all net points but a fraction of cities. This means every process calculates a fraction of each delta value. Those are reduced using addition. The net points are broadcasted after the deltas have been applied to them.

Another approach is to compute the delta vector using all cities but a fraction of net points. This means every process computes a fraction of the delta vector. The delta vector is then merged at the root process using gather. The net points are broadcasted after the deltas have been applied to them. Again if you’re not using a MPI virtual topology then the needed boundary communication is not a trivial thing. Like in the Game of Life there is a special case of less than three points are assigned on a process. In this case it is not defined which is the data to calculate and what are the borders.

Outer Product

Takes a vector of points to create a dense, symmetric, diagonally dominant matrix by calculating distances from each point to every other point. Furthermore calculates a vector of point to origin distances. Every process holds a local copy of the vector of points. The vector calculation is performed by every process on a local fraction of the distributed vector.

The matrix calculation is more difficult:

1. The simplest way is that each process calculates a complete fraction of the solution matrix. The matrix is later sent to the root process and concatenated. This does twice as much calculation as necessary because the matrix splits into 2 identical triangles.

2. The second idea takes advantage of the symmetric structure of the matrix and only calculates one of the symmetric triangles. Either those calculated values are send to the appropriate processes. This way each process holds a complete version of its local fraction of the solution that is send to the root process. This approach saves time not calculating any symmetric values but uses plenty of communication to achieve it.

3. Or the calculated triangle is send to the root process which takes care of mirroring the triangle. This approach saves time in comparison to the second but again loses it when the root process does the mirroring all on its own.

Although the third approach seemed most promising it grew slower than the simple first one. This is due to the time needed for array copy operations of the root process which where too slow. The second one makes so extensive use of communication that the time for sending data exceeds the time for calculating it. The first approach is the fastest. As memory and communication operations grow faster every day the second and third idea shall become interesting in time.

Keep in mind that shared memory parallel systems like OpenMP dont have the adressed problems. A version making use of the symmetric nature works well.

Successive Overrelaxation

Solves a matrix equation AX = V using the iterative successive overrelaxation approach. A is a dense, symmetric, diagonally dominant matrix. V an arbitrary non-zero vector. X is the solution vector.

During each iteration every element of the solution vector is relaxed (converged) towards a more accurate solution. Then the solution vector is updated among all processes. This goes on either as long as there is enough change to the preceding value or maximum number of iterations is reached.

There is also the possibility of updating the new values of the solution vector more often. This can lead to a faster convergence, hence less iterations needed. But the communication overhead produced exceeds the benefits of faster convergence.

Speedups

Speed-ups have been measured on five systems. For clarity here only posted for the hrz-cs459 Opteron System. The parallel programs have been measured against a 1 process version. So bear in mind that these are no real speedups.

You can download a 40 page speedup document. It contains speedup measurements for MPI and OpenMP implementations using all available computer systems.

PDF document with all speeupd measurements


There are mainly speeddowns using a small 128x128 matrix. This is due to the overhead of communication used. There is another run with larger matrics below. Image:MPI_speedup_table_128.png

As you can see with larger matrices, speedups are gained. In this case the computation takes more time in contrast to the communication. The programs are oriented for larger input files. You may be irritated that there are no timings for Inversion Percolation, that is because we are still waiting for the results. :) Image:MPI_speedup_table.jpg

Conclusion

MPI

MPI is a very powerful tool for parallel computing. It is quite complex and offers the possibilities to do things the way on likes it. There is a wide range of available communication methods, the possibility of using own data types, topologies,… and even more with MPI-2. As much as this standard is complex and powerful it is although difficult (that’s not a bad thing, it’s just a consequence). It’s a large difference explicitly programming parallel with MPI or using convenience functions for parallelising in OpenMP.

Communication happens through message passing. This addresses another problem. MPI is not a standard designed for extensive use of communication. Communication takes time. This is due to the slow communication speed and latency, especially over networks. But the possibility to make use of networked computers opens up enormous computational power. This leads o the conclusion that MPI is destined for heavy computations, but not heavy Inter Process Communication

Cowichan Problems

The Cowichan Problems were invented to provide an objective basis for comparing different existing parallel programming systems. On the one hand parallelizing them was difficult sometimes, on the other hand sometimes no speed ups could be measured. This is not due to a badly designed parallel system. There are some problems far from being parallelizable. The Cowichan Problems face this problem only on a partly basis. But they were invented when cpu and memory speed were equal. During the last years cpu’s gained speed while memory access times stayed the same. That is why they provided Mr. Cowichan with good speedups. His programs did not parallelize only the calculations, but also distributed the memory access time among all participating workers. As cpu speed exceeds memory speed this gain has become unavailable. The processes are more or less waiting for memory access. Maybe the Cowichan Problems have to be considered outdated.

MPI Tools

Many tools like debuggers, performance visualization tools are available.
MPImap graphically displays the structure of user-defined MPI data types.
MPI Check is an MPI program debugger for Fortran90 and Fortran77.
mpC is a programming environment for a distributed memory system. It explicitly defines an abstract network and distributes data, computations and communications over the network.
mpptest is a program for measuring MPI Message Passing Routines in a variety of situations. mpptest can measure performance with many participating processes.
PALM_Research is the SPMD version of the PALM software. PALM software's purpose is to handle complex applications in a modular and parallel way to ensure their evolutivity and guarantee their performance.
PALM_MP is the MPMD (Multiple Programs Multiple Data) version of the PALM software.
Etnus: A debugger for Linux and Unix applications. It supports MPI and OpenMP.
Paraver : Visualization environment for MPI, OpenMP, Mixed, MLP, Java. Some Paraver features are the support for detailed quantitative analysis of program performance, concurrent comparative analysis of several traces, fast analysis of very large traces, mixed message passing and shared memory (networks of SMPs) etc...

Related links

MPI forum
Argonne National Laboratory
MPI Tutorial
Another MPI Tutorial
MPICH
Good German MPI Site!