Parallelization: Difference between revisions

VisualWikitext

Latest revision as of 11:40, 8 April 2022

Redirect to:

Category:Parallelization

@@ Line 1: / Line 1: @@
-For many complex problems, a single core is not enough to finish the calculation in a reasonable time.
+#REDIRECT [[:Category:Parallelization]]
-VASP makes use of parallel machines splitting the calculation into many tasks, that communicate with each other using MPI.
-By default, VASP distributes the number of bands ({{TAG|NBANDS}}) over the available MPI ranks.
-But it is often beneficial to add parallelization of the FFTs ({{TAG|NCORE}}), parallelization over '''k''' points ({{TAG|KPAR}}), and parallelization over separate calculations ({{TAG|IMAGES}}).
-All these tags default to 1 and divide the number of MPI ranks among the parallelization options.
-There are also additional parallelization options for some algorithms in VASP.
-::<math>
-\text{total ranks} = \text{ranks parallelizing bands} \times \text{NCORE} \times \text{KPAR} \times \text{IMAGES} \times \text{other algorithm-dependent tags}
-</math>
-In addition to the parallelization using MPI, VASP can make use of [[Hybrid_MPI/OpenMP_parallelization|OpenMP-threading]] and/or [[OpenACC_GPU_port_of_VASP|OpenACC (for the GPU-port)]].
-Note that running on multiple OpenMP threads and/or GPUs switches off the {{TAG|NCORE}} parallelization.
-==Optimizing the parallelization==
-{{NB|tip|We offer only general advice here. The performance for specific systems may be significantly different. However, in many cases, one is interested in similar calculations. Then run a few of these cases varying the parallel setup and use the optimal choice of parameters for the rest.}}
-When choosing the optimal performance try to get as close as possible to the actual system.
-This includes both the physical system (atoms, cell size, cutoff, ...) as well as the computational hardware (CPUs, interconnect, number of nodes, ...).
-If too many parameters are different, the parallel configuration may not be transferable to the production calculation.
-Nevertheless, a few steps of repetitive tasks give a good idea of an optimal choice for the full calculation.
-For example, running only a few electronic or ionic self-consistency steps instead of finishing the convergence.
-Often, combining multiple parallelization options yields the fastest results because the parallel efficiency of each level drops near its limit.
-For the default option (band parallelization), the limit is {{TAG|NBANDS}} divided by a small integer.
-Note that VASP will increase {{TAG|NBANDS}} to match the number of ranks.
-Choose {{TAG|NCORE}} as a factor of the cores per node to avoid communicating between nodes for the FFTs.
-Recall that OpenMP and OpenACC enforce that {{TAG|NCORE}} is not set.
-The '''k'''-point parallelization is efficient but requires additional memory.
-Given sufficient memory, increase {{TAG|KPAR}} up to the number of irreducible '''k''' points.
-Keep in mind that {{TAG|KPAR}} should factorize the number of '''k''' points.
-Finally, {{TAG|IMAGES}} is required to split several VASP runs into separate calculations.
-The limit is dictated by the number of desired calculations.
-==Caveat about the MPI setup==
-The MPI setup determines the placement of the ranks onto the nodes.
-VASP assumes the ranks first fill up a node before the next node is occupied.
-As an example when running with 8 ranks on two nodes, VASP expects rank 1–4 on node 1 and rank 5–8 on node 2.
-If the ranks are placed differently, communication between the nodes occurs for every parallel FFT.
-Because FFTs are essential to VASP's speed this inhibits the performance of the calculation.
-A manifestation is an increase in computing time when the number of nodes is increased from 1 to 2.
-If {{TAG|NCORE}} is not used this issue is less severe but will still reduce the performance.
-To address this issue, please check the setup of the MPI library and the submitted job script.
-It is usually possible to overwrite the placement by setting environment variables or command-line arguments.
-When in doubt, contact the HPC administration of your machine to investigate the behavior.
-==Additional parallelization options==
-; {{TAG|KPAR}}: For Laplace transformed MP2 this tag [[LTMP2_-_Tutorial#Parallelization|has a different meaning]].
-; {{TAG|NCORE_IN_IMAGE1}}: Defines how many ranks work on the first image in the thermodynamic coupling constant integration ({{TAG|VCAIMAGES}}).
-; {{TAG|NOMEGAPAR}}: Parallelize over imaginary frequency points in GW and RPA calculations.
-; {{TAG|NTAUPAR}}: Parallelize over imaginary time points in GW and RPA calculations.
-==OpenMP/OpenACC==
-Both [[Hybrid_MPI/OpenMP_parallelization|OpenMP]] and [[OpenACC_GPU_port_of_VASP|OpenACC]] parallelize the FFTs and therefore disregard any conflicting specification of {{TAG|NCORE}}.
-When combining these methods OpenACC takes precedence but any code not ported to OpenACC benefits from the additional OpenMP treads.
-This approach is relevant because the recommended NVIDIA Collective Communications Library requires a single MPI rank per GPU.