Vasp crashes on one specific GPU

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
ian_steegmayer
Newbie
Newbie
Posts: 1
Joined: Mon Jan 15, 2024 2:38 pm

Vasp crashes on one specific GPU

#1 Post by ian_steegmayer » Wed Oct 23, 2024 11:25 am

I have compiled Vasp 6.4.3 with GPU support on Ubuntu server 24.04 without any errors. We have four identical Nvidia A100 GPUs on our system and use Slurm to manage resources. Whenever a job attempts to use GPU 2, it immediately crashes. It is not dependent on the job, so far all jobs work perfectly fine on GPUs 0,1, and 3. We only run jobs on single GPUs and the test suite did run successfully (using GPU 0). All GPUs work perfectly fine for other tasks, e.g., training neural networks for inter-atomic potentials. What could cause these bugs?

This is the stdout of a crashed job:

Code: Select all

 running    1 mpi-ranks, with    2 threads/rank, on    1 nodes
 distrk:  each k-point on    1 cores,    1 groups
 distr:  one band on    1 cores,    1 groups
 OpenACC runtime initialized ...    1 GPUs detected

And this is the stderr message:

Code: Select all

 -----------------------------------------------------------------------------
|                     _     ____    _    _    _____     _                     |
|                    | |   |  _ \  | |  | |  / ____|   | |                    |
|                    | |   | |_) | | |  | | | |  __    | |                    |
|                    |_|   |  _ <  | |  | | | | |_ |   |_|                    |
|                     _    | |_) | | |__| | | |__| |    _                     |
|                    (_)   |____/   \____/   \_____|   (_)                    |
|                                                                             |
|     internal error in: mpi.F  at line: 903                                  |
|                                                                             |
|     M_init_nccl: Error in ncclCommInitRank                                  |
|                                                                             |
|     If you are not a developer, you should not encounter this problem.      |
|     Please submit a bug report.                                             |
|                                                                             |
 -----------------------------------------------------------------------------

Warning: ieee_inexact is signaling
    1
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[55828,1],0]
  Exit code:    1
--------------------------------------------------------------------------

jonathan_lahnsteiner2
Global Moderator
Global Moderator
Posts: 215
Joined: Fri Jul 01, 2022 2:17 pm

Re: Vasp crashes on one specific GPU

#2 Post by jonathan_lahnsteiner2 » Wed Oct 23, 2024 3:01 pm

Dear Ian Steegmayer,

This seems strange indeed. You could try to recompile VASP without NCCL and see if this error goes away. You can do this by removing -DUSENCCL from your makefile.include. This is likely to solve the problem because the error occurs during NCCL initializtaion.

Otherwise if you are interested to analyze the problem further you could try setting the environment variable NCCL_DEBUG=WARN or TRACE. And rerun VASP without recompilation and check if you get further insight into the problem. Another option would be to print the ncclRes variable from M_init_nccl to see what error code it returns, but that would require recompiling the code, so perhaps the first option, removing DUSENCCL from the makfile.include is simpler.

All the Best Jonathan


Post Reply