\(\renewcommand{\AA}{\text{Å}}\)

4.4.5. OpenMP Parallelism

The styles in the INTEL, KOKKOS, and OPENMP packages offer to use OpenMP thread parallelism to predominantly distribute loops over local data and thus follow an orthogonal parallelization strategy to the decomposition into spatial domains used by the MPI partitioning. For clarity, this section discusses only the implementation in the OPENMP package, as it is the simplest. The INTEL and KOKKOS packages offer additional options and are more complex since they support more features and different hardware like co-processors or GPUs.

One of the key decisions when implementing the OPENMP package was to keep the changes to the source code small, so that it would be easier to maintain the code and keep it in sync with the non-threaded standard implementation. This is achieved by a) making the OPENMP version a derived class from the regular version (e.g. PairLJCutOMP from PairLJCut) and only overriding methods that are multi-threaded or need to be modified to support multi-threading (similar to what was done in the OPT package), b) keeping the structure in the modified code very similar so that side-by-side comparisons are still useful, and c) offloading additional functionality and multi-thread support functions into three separate classes ThrOMP, ThrData, and FixOMP. ThrOMP provides additional, multi-thread aware functionality not available in the corresponding base class (e.g. Pair for PairLJCutOMP) like multi-thread aware variants of the “tally” functions. Those functions are made available through multiple inheritance, so those new functions have to have unique names to avoid ambiguities; typically _thr is appended to the name of the function. ThrData is a class that manages per-thread data structures. It is used instead of extending the corresponding storage to per-thread arrays to avoid slowdowns due to “false sharing” when multiple threads update adjacent elements in an array and thus force the CPU cache lines to be reset and re-fetched. FixOMP finally manages the “multi-thread state” like settings and access to per-thread storage, it is activated by the package omp command.

Avoiding data races

A key problem when implementing thread parallelism in an MD code is to avoid data races when updating accumulated properties like forces, energies, and stresses. When interactions are computed, they always involve multiple atoms and thus there are race conditions when multiple threads want to update per-atom data of the same atoms. Five possible strategies have been considered to avoid this:

  1. Restructure the code so that there is no overlapping access possible when computing in parallel, e.g. by breaking lists into multiple parts and synchronizing threads in between.

  2. Have each thread be “responsible” for a specific group of atoms and compute these interactions multiple times, once on each thread that is responsible for a given atom, and then have each thread only update the properties of this atom.

  3. Use mutexes around functions and regions of code where the data race could happen.

  4. Use atomic operations when updating per-atom properties.

  5. Use replicated per-thread data structures to accumulate data without conflicts and then use a reduction to combine those results into the data structures used by the regular style.

Option 5 was chosen for the OPENMP package because it would retain the performance for the case of a single thread and the code would be more maintainable. Option 1 would require extensive code changes, particularly to the neighbor list code; option 2 would have incurred a 2x or more performance penalty for the serial case; option 3 causes significant overhead and would enforce serialization of operations in inner loops and thus defeat the purpose of multi-threading; option 4 slows down the serial case although not quite as bad as option 2. The downside of option 5 is that the overhead of the reduction operations grows with the number of threads used, so there would be a crossover point where options 2 or 4 would result in faster executing. That is why option 2 for example is used in the GPU package because a GPU is a processor with a massive number of threads. However, since the MPI parallelization is generally more effective for typical MD systems, the expectation is that thread parallelism is only used for a smaller number of threads (2-8). At the time of its implementation, that number was equivalent to the number of CPU cores per CPU socket on high-end supercomputers.

Thus arrays like the force array are dimensioned to the number of atoms times the number of threads when enabling OpenMP support, and inside the compute functions a pointer to a different chunk is obtained by each thread. Similarly, accumulators like potential energy or virial are kept in per-thread instances of the ThrData class and then only reduced and stored in their global counterparts at the end of the force computation.

Loop scheduling

Multi-thread parallelization is applied by distributing (outer) loops statically across threads. Typically, this would be the loop over local atoms i when processing i,j pairs of atoms from a neighbor list. The design of the neighbor list code results in atoms having a similar number of neighbors for homogeneous systems and thus load imbalances across threads are not common and typically happen for systems where also the MPI parallelization would be unbalanced, which would typically have a more pronounced impact on the performance. This same loop scheduling scheme can also be applied to the reduction operations on per-atom data to try and reduce the overhead of the reduction operation.

Neighbor list parallelization

In addition to the parallelization of force computations, also the generation of the neighbor lists is parallelized. As explained previously, neighbor lists are built by looping over “owned” atoms and storing the neighbors in “pages”. In the OPENMP variants of the neighbor list code, each thread operates on a different chunk of “owned” atoms and allocates and fills its own set of pages with neighbor list data. This is achieved by each thread keeping its own instance of the MyPage page allocator class.