OpenMP & MPI

  • Clustered architectures have distributed memory systems,where each node consist of a traditional memory multiprocessor.

  • Single address space within each node, but separate nodes have separate address space.

Development / maintenance

  • In most cases, development and maintenance will be harder than for an MPI code, and much harder than for an OpenMP code.

  • If MPI code already exists, addition of OpenMP may not be too much overhead.

  • In some cases, it may be possible to use a simpler MPI implementation because the need for scalability is reduced.

    • e.g. 1-D domain decomposition instead of 2-D

Portability

  • Both OpenMP and MPI are themselves highly portable (but not perfect).

  • Combined MPI/OpenMP is less so

    • main issue is thread safety of MPI

    • if maximum thread safety is assumed, portability will be reduced

  • Desirable to make sure code functions correctly (maybe with conditional compilation) as stand-alone MPI code (and as stand-alone OpenMP code)

Performance

Four possible performance reasons for mixed OpenMP/MPI codes:

  • Replicated data

    • Replicated data strategy

      • all processes have a copy of a major data structure

      • classical domain decomposition code have replication in halos

      • MPI buffers can consume significant amounts of memory

    • A pure MPI code needs one copy per process/core

    • A mixed code would only require one copy per node

      • data structure can be shared by multiple threads within a process

      • MPI buffers for intra-node messages no longer required

    • Halo regions are a type of replicated data

      • Although the amount of halo data does decrease as the local domain size decreases, it eventually starts to occupy a significant amount fraction of the storage

  • Poorly scaling MPI codes

    • If the MPI version of the code scales poorly, then a mixed MPI/OpenMP version may scale better.

    • May be true in cases where OpenMP scales better than MPI due to:

      • Algorithmic reasons. e.g. Load balancing easier to achieve in OpenMP.

      • Simplicity reasons. e.g. 1-D domain decomposition.

  • Load balancing

    • Load balancing between MPI processes can be hard

      • need to transfer both computational tasks and data from overloaded to underloaded processes

      • transferring small tasks may not be beneficial

      • having a global view of loads may not scale well

      • may need to restrict to transferring loads only between neighbors

    • Load balancing between threads is much easier

      • only need to transfer tasks, not data

      • overheads are lower, so fine grained balancing is possible

      • easier to have a global view

    • For applications with load balance problems, keeping the number of MPI processes small can be an advantage. (Maybe statistically)

  • Mix-mode implementation effect:

    • reduce within a node via OpenMP reduction clause

    • then reduce across nodes with MPI_Reduce

    • send one large message per node instead of several small ones

    • reduces latency effects, and contention for network injection

Styles of mixed-mode programming

Master-only

  • Definition: all MPI communication takes place in the sequential part of the OpenMP program (no MPI in parallel regions)

  • Advantages

    • simple to write and maintain

    • clear separation between outer (MPI) and inner (OpenMP) levels of parallelism

    • no concerns about synchronising threads before/after sending messages

  • Disadvantages

    • threads other than the master are idle during MPI calls

    • all communicated data passes through the cache where the master thread is executing.

    • inter-process and inter-thread communication do not overlap.

    • only way to synchronise threads before and after message transfers is by parallel regions which have a relatively high overhead.

    • packing/unpacking of derived data types is sequential.

#pragma omp parallel
{
    work...
}

ierror = MPI_Send(...);

#pragma omp parallel
{
    work...
}

Funneled

  • Definition

    • all MPI communication takes place through the same (master) thread, can be inside parallel regions

  • Advantages

    • relatively simple to write and maintain

    • cheaper ways to synchronise threads before and after message transfers

    • possible for other threads to compute while master is in an MPI call

  • Disadvantages

    • less clear separation between outer (MPI) and inner (OpenMP) levels of parallelism

    • all communicated data still passes through the cache where the master thread is executing.

    • inter-process and inter-thread communication still do not overlap.

#pragma omp parallel
{
    work...
    #pragma omp barrier
    {
        #pragma omp master
        {
            ierror = MPI_Send(...);
        }
    }
    #pragma omp barrier
    work...
}

Serialized

  • Definition

    • only one thread makes MPI calls at any one time (Don't know which thread will call)

    • distinguish sending/receiving threads via MPI tags or communicators

    • be very careful about race conditions on send/recv buffers etc.

  • Advantages

    • easier for other threads to compute while one is in an MPI call

    • can arrange for threads to communicate only their “own” data (i.e. the data they read and write).

  • Disadvantages

    • getting harder to write/maintain

    • more, smaller messages are sent, incurring additional latency overheads

    • need to use tags or communicators to distinguish between messages from or to different threads in the same MPI process.

#pragma omp parallel
{
    work...
    #pragma omp critical
    {
        ierror = MPI_Send(...);
    }
    work...
}

Multiple

  • Definition

    • MPI communication simultaneously in more than one thread

    • some MPI implementations don’t support this

    • and those which do mostly don’t perform well

  • Advantages

    • Messages from different threads can (in theory) overlap

    • many MPI implementations serialise them internally.

    • Natural for threads to communicate only their “own” data

    • Fewer concerns about synchronising threads (responsibility passed to the MPI library)

  • Disdavantages

    • Hard to write/maintain

    • Not all MPI implementations support this – loss of portability

    • Most MPI implementations don’t perform well like this

    • Thread safety implemented crudely using global locks.

#pragma omp parallel
{
    work...
    ierror = MPI_Send(...);
    work...
}

MPI execution environment

MPI_Init_thread

  • works in a similar way to MPI_Init by initializing MPI on the main thread.

  • Two integer arguments:

    • Required ([in] Level of desired thread support)

    • Provided ([out] Level of provided thread support)

  • Levels from MPICH:

    • MPI_THREAD_SINGLE

      • Only one thread will execute.

    • MPI_THREAD_FUNNELED

      • The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are funneled to the main thread).

    • MPI_THREAD_SERIALIZED

      • The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are serialized).

    • MPI_THREAD_MULTIPLE

      • Multiple threads may call MPI, with no restrictions.

MPI_Query_thread()

  • returns the current level of thread support

  • int MPI_Query_thread(int *provided)

  • Need to compare the output manually:

int provided, requested;
...
MPI_Query_thread(&provided);
if (provided < requested) {
    printf("Not a high enough level of thread support!\n");
    MPI_Abort(MPI_COMM_WORLD, 1);
    ...
}

Reference

Last updated