Page 1 of 1

error during GPU calculations

Posted: Tue May 14, 2024 10:42 am
by sergey_lisenkov1
Hello,

I was running phonon calculations on GPU with hybrid functional, when this error appeared after first step:

Code: Select all

--------------------------------------------------------------------------
 running    6 mpi-ranks, on    1 nodes
 distrk:  each k-point on    3 cores,    2 groups
 distr:  one band on    1 cores,    3 groups
 OpenACC runtime initialized ...    6 GPUs detected
 vasp.6.4.3 19Mar24 (build May 13 2024 20:23:18) complex                         
 POSCAR found type information on POSCAR CsGeBr
 POSCAR found :  3 types and       5 ions
 scaLAPACK will be used selectively (only on CPU)
-----------------------------------------------------------------------------
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|     For optimal performance we recommend to set                             |
|       NCORE = 2 up to number-of-cores-per-socket                            |
|     NCORE specifies how many cores store one orbital (NPAR=cpu/NCORE).      |
|     This setting can greatly improve the performance of VASP for DFT.       |
|     The default, NCORE=1 might be grossly inefficient on modern             |
|     multi-core architectures or massively parallel machines. Do your        |
|     own testing! More info at https://www.vasp.at/wiki/index.php/NCORE      |
|     Unfortunately you need to use the default for GW and RPA                |
|     calculations (for HF NCORE is supported but not extensively tested      |
|     yet).                                                                   |
|                                                                             |
 -----------------------------------------------------------------------------

 LDA part: xc-table for (Slater+PW92), standard interpolation
 -----------------------------------------------------------------------------
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|     The requested file  could not be found or opened for reading            |
|     k-point information. Automatic k-point generation is used as a          |
|     fallback, which may lead to unwanted results.                           |
|                                                                             |
 -----------------------------------------------------------------------------

 found WAVECAR, reading the header
  number of bands has changed, file:    28 present:    27
  trying to continue reading WAVECAR, but it might fail
  number of k-points has changed, file:    10 present:    20
  trying to continue reading WAVECAR, but it might fail
 WAVECAR: different cutoff or change in lattice found
 POSCAR, INCAR and KPOINTS ok, starting setup
 FFT: planning ... GRIDC
 FFT: planning ... GRID_SOFT
 FFT: planning ... GRID
[tra011:3061229] 5 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[tra011:3061229] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
 reading WAVECAR
 the WAVECAR file was read successfully
 WARNING: dimensions on CHGCAR file are different
 entering main loop
       N       E                     dE             d eps       ncg     rms          ort
DAV:   1    -0.349869116478E+04   -0.34987E+04   -0.12385E+02   540   0.605E+01
DAV:   2    -0.350002263637E+04   -0.13315E+01   -0.13304E+01   540   0.205E+01
DAV:   3    -0.350016885813E+04   -0.14622E+00   -0.14613E+00   540   0.645E+00
DAV:   4    -0.350019612166E+04   -0.27264E-01   -0.27252E-01   540   0.247E+00
DAV:   5    -0.350020228109E+04   -0.61594E-02   -0.61572E-02   540   0.111E+00
DAV:   6    -0.350020381351E+04   -0.15324E-02   -0.15320E-02   540   0.522E-01
DAV:   7    -0.350020421819E+04   -0.40468E-03   -0.40458E-03   540   0.256E-01
DAV:   8    -0.350020432980E+04   -0.11161E-03   -0.11159E-03   540   0.130E-01
DAV:   9    -0.350020436164E+04   -0.31838E-04   -0.31830E-04   540   0.667E-02
DAV:  10    -0.350020437098E+04   -0.93388E-05   -0.93388E-05   540   0.354E-02
DAV:  11    -0.350020437379E+04   -0.28097E-05   -0.28086E-05   540   0.187E-02
DAV:  12    -0.350020437465E+04   -0.86426E-06   -0.86454E-06   540   0.102E-02
DAV:  13    -0.350020437492E+04   -0.27245E-06   -0.27224E-06   540   0.556E-03
DAV:  14    -0.350020437501E+04   -0.86348E-07   -0.86445E-07   540   0.317E-03
DAV:  15    -0.350020437504E+04   -0.27110E-07   -0.27057E-07   540   0.182E-03
DAV:  16    -0.193978720981E+02    0.34808E+04   -0.12621E+01   540   0.292E+01
 gam= 0.000 g(H,U,f)=  0.543E+01 0.701E+00 0.497E-41 ort(H,U,f) = 0.000E+00 0.000E+00 0.000E+00
SDA:  17    -0.174905681945E+02    0.19073E+01   -0.18396E+01   540   0.613E+01 0.000E+00
 gam= 0.442 g(H,U,f)=  0.487E+00 0.252E+00 0.179E-12 ort(H,U,f) = 0.134E+01 0.432E+00-0.617E-13
DMP:  18    -0.186786051067E+02   -0.11880E+01   -0.45624E+00   540   0.739E+00 0.177E+01
 gam= 0.442 g(H,U,f)=  0.521E+00 0.450E-01 0.899E-27 ort(H,U,f) =-0.858E+00 0.189E+00 0.530E-27
DMP:  19    -0.188065714793E+02   -0.12797E+00   -0.81052E-01   540   0.566E+00-0.669E+00
 gam= 0.442 g(H,U,f)=  0.444E+00 0.962E-02 0.119E-25 ort(H,U,f) =-0.251E-01 0.594E-01 0.306E-25
DMP:  20    -0.188522281134E+02   -0.45657E-01   -0.14074E+00   540   0.454E+00 0.343E-01
 gam= 0.442 g(H,U,f)=  0.103E+00 0.788E-02 0.150E-46 ort(H,U,f) = 0.132E+00 0.327E-01-0.221E-45
DMP:  21    -0.189473877130E+02   -0.95160E-01   -0.55176E-01   540   0.111E+00 0.165E+00
 gam= 0.442 g(H,U,f)=  0.394E-01 0.941E-02 0.727E-51 ort(H,U,f) =-0.692E-02 0.244E-01-0.769E-51
DMP:  22    -0.189776336004E+02   -0.30246E-01   -0.16955E-01   540   0.488E-01 0.175E-01
 gam= 0.442 g(H,U,f)=  0.274E-01 0.721E-02 0.845E-41 ort(H,U,f) =-0.537E-02 0.177E-01 0.448E-41
DMP:  23    -0.189879653733E+02   -0.10332E-01   -0.12029E-01   540   0.346E-01 0.124E-01
 gam= 0.442 g(H,U,f)=  0.850E-02 0.315E-02 0.193E-26 ort(H,U,f) = 0.282E-02 0.100E-01 0.330E-26
DMP:  24    -0.189958781096E+02   -0.79127E-02   -0.51934E-02   540   0.117E-01 0.128E-01
 gam= 0.442 g(H,U,f)=  0.363E-02 0.864E-03 0.169E-15 ort(H,U,f) =-0.153E-02 0.395E-02 0.447E-15
DMP:  25    -0.189988296246E+02   -0.29515E-02   -0.16699E-02   540   0.450E-02 0.242E-02
 gam= 0.442 g(H,U,f)=  0.233E-02 0.197E-03 0.116-122 ort(H,U,f) =-0.174E-03 0.121E-02 0.311-122
DMP:  26    -0.189998282460E+02   -0.99862E-03   -0.89687E-03   540   0.253E-02 0.104E-02
 gam= 0.442 g(H,U,f)=  0.835E-03 0.634E-04 0.224-102 ort(H,U,f) = 0.461E-03 0.390E-03 0.518-102
DMP:  27    -0.190004161516E+02   -0.58791E-03   -0.38229E-03   540   0.899E-03 0.851E-03
 gam= 0.442 g(H,U,f)=  0.350E-03 0.327E-04 0.568E-86 ort(H,U,f) = 0.665E-04 0.162E-03 0.126E-85
DMP:  28    -0.190006475358E+02   -0.23138E-03   -0.14520E-03   540   0.383E-03 0.229E-03
 gam= 0.442 g(H,U,f)=  0.180E-03 0.173E-04 0.195E-72 ort(H,U,f) = 0.285E-04 0.735E-04 0.468E-72
DMP:  29    -0.190007350998E+02   -0.87564E-04   -0.72648E-04   540   0.197E-03 0.102E-03
 gam= 0.442 g(H,U,f)=  0.677E-04 0.712E-05 0.164E-61 ort(H,U,f) = 0.158E-04 0.294E-04 0.450E-61
DMP:  30    -0.190007754035E+02   -0.40304E-04   -0.28436E-04   540   0.749E-04 0.451E-04
 gam= 0.442 g(H,U,f)=  0.326E-04 0.289E-05 0.534E-53 ort(H,U,f) =-0.640E-06 0.961E-05 0.164E-52
DMP:  31    -0.190007893927E+02   -0.13989E-04   -0.11848E-04   540   0.355E-04 0.897E-05
 gam= 0.442 g(H,U,f)=  0.154E-04 0.174E-05 0.167E-46 ort(H,U,f) = 0.429E-05 0.360E-05 0.551E-46
DMP:  32    -0.190007968830E+02   -0.74903E-05   -0.61901E-05   540   0.172E-04 0.789E-05
 gam= 0.442 g(H,U,f)=  0.562E-05 0.110E-05 0.156E-41 ort(H,U,f) = 0.303E-05 0.223E-05 0.530E-41
DMP:  33    -0.190008018196E+02   -0.49366E-05   -0.27126E-05   540   0.672E-05 0.526E-05
 gam= 0.442 g(H,U,f)=  0.269E-05 0.582E-06 0.114E-37 ort(H,U,f) = 0.737E-06 0.150E-05 0.392E-37
DMP:  34    -0.190008039976E+02   -0.21780E-05   -0.12766E-05   540   0.327E-05 0.223E-05
 gam= 0.442 g(H,U,f)=  0.151E-05 0.277E-06 0.127E-34 ort(H,U,f) = 0.688E-06 0.865E-06 0.445E-34
DMP:  35    -0.190008047470E+02   -0.74938E-06   -0.74085E-06   540   0.178E-05 0.155E-05
 gam= 0.442 g(H,U,f)=  0.617E-06 0.129E-06 0.337E-32 ort(H,U,f) = 0.549E-06 0.451E-06 0.123E-31
DMP:  36    -0.190008049573E+02   -0.21028E-06   -0.35648E-06   540   0.746E-06 0.100E-05
 gam= 0.442 g(H,U,f)=  0.279E-06 0.611E-07 0.301E-30 ort(H,U,f) = 0.203E-06 0.225E-06 0.116E-29
DMP:  37    -0.190008049878E+02   -0.30526E-07   -0.15885E-06   540   0.340E-06 0.429E-06
 gam= 0.442 g(H,U,f)=  0.165E-06 0.288E-07 0.122E-28 ort(H,U,f) = 0.142E-06 0.110E-06 0.488E-28
DMP:  38    -0.190008050643E+02   -0.76510E-07   -0.91373E-07   540   0.193E-06 0.252E-06
 final diagonalization
 -----------------------------------------------------------------------------
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|     One or more components of EFIELD_PEAD are too large for comfort. In     |
|     all probability, you are too near to the onset of Zener tunneling.      |
|                                                                             |
|          e |E dot A_1| =  0.02779 > 1/10, E_g/N_1 =  0.01822                |
|          e |E dot A_2| =  0.00000 > 1/10, E_g/N_2 =  0.01822                |
|          e |E dot A_3| =  0.00000 > 1/10, E_g/N_3 =  0.01822                |
|                                                                             |
|     Possible SOLUTIONS:                                                     |
|      ) Choose a smaller electric field.                                     |
|      ) Use a less dense grid of k-points.                                   |
|                                                                             |
 -----------------------------------------------------------------------------

 -----------------------------------------------------------------------------
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|     The requested file  could not be found or opened for reading            |
|     k-point information. Automatic k-point generation is used as a          |
|     fallback, which may lead to unwanted results.                           |
|                                                                             |
 -----------------------------------------------------------------------------

       N       E                     dE             d eps       ncg     rms          ort
 gam= 0.000 g(H,U,f)=  0.221E-06 0.276E-07 0.643E-31 ort(H,U,f) = 0.000E+00 0.000E+00 0.000E+00
SDA:   1    -0.190008050969E+02   -0.10913E-06   -0.74493E-07  1080   0.248E-06 0.000E+00
 gam= 0.442 g(H,U,f)=  0.113E-06 0.131E-07 0.502E-22 ort(H,U,f) = 0.154E-06 0.187E-07 0.519E-22
DMP:   2    -0.190008043419E+02    0.75502E-06   -0.60557E-07  1080   0.126E-06 0.172E-06
 gam= 0.442 g(H,U,f)=  0.649E-07 0.408E-08 0.230E-21 ort(H,U,f) = 0.125E-06 0.112E-07 0.218E-21
DMP:   3    -0.190008045174E+02   -0.17551E-06   -0.38720E-07  1080   0.689E-07 0.136E-06
 gam= 0.442 g(H,U,f)=  0.349E-07 0.994E-09 0.307E-20 ort(H,U,f) = 0.887E-07 0.365E-08 0.281E-20
DMP:   4    -0.190008046617E+02   -0.14429E-06   -0.22996E-07  1080   0.359E-07 0.923E-07
 gam= 0.442 g(H,U,f)=  0.171E-07 0.248E-09 0.545E-19 ort(H,U,f) = 0.529E-07 0.608E-09 0.710E-19
DMP:   5    -0.190008047633E+02   -0.10163E-06   -0.12299E-07  1080   0.173E-07 0.535E-07
 gam= 0.442 g(H,U,f)=  0.923E-08 0.150E-09-0.161E-33 ort(H,U,f) = 0.300E-07-0.402E-10-0.356E-33
DMP:   6    -0.190008048326E+02   -0.69234E-07   -0.67828E-08  1080   0.938E-08 0.300E-07
 gam= 0.442 g(H,U,f)=  0.492E-08 0.124E-09 0.895-133 ort(H,U,f) = 0.171E-07 0.917E-11 0.371-132
DMP:   7    -0.190008048812E+02   -0.48655E-07   -0.37825E-08  1080   0.504E-08 0.171E-07
 gam= 0.442 g(H,U,f)=  0.280E-08 0.771E-10 0.204-126 ort(H,U,f) = 0.973E-08 0.762E-10 0.157-125
DMP:   8    -0.190008049161E+02   -0.34888E-07   -0.21613E-08  1080   0.288E-08 0.980E-08
 final diagonalization
 p_tot=( -0.778E+02 -0.778E+02 -0.778E+02 )
       N       E                     dE             d eps       ncg     rms          ort
 p_tot=( -0.778E+02 -0.778E+02 -0.778E+02 )
dp_tot=(  0.000E+00  0.000E+00  0.000E+00 )  diag[e(oo)]=(  1.00000    ---      ---   )
 -----------------------------------------------------------------------------
|                     _     ____    _    _    _____     _                     |
|                    | |   |  _ \  | |  | |  / ____|   | |                    |
|                    | |   | |_) | | |  | | | |  __    | |                    |
|                    |_|   |  _ <  | |  | | | | |_ |   |_|                    |
|                     _    | |_) | | |__| | | |__| |    _                     |
|                    (_)   |____/   \____/   \_____|   (_)                    |
|                                                                             |
|     internal error in: ./fft3dbatched.F  at line: 769                       |
|                                                                             |
|     FFTBAS_MU: no cuFFT plan found for ACC_ASYNC_Q= 1 and batch size N=     |
|     22                                                                      |
|                                                                             |
|     If you are not a developer, you should not encounter this problem.      |
|     Please submit a bug report.                                             |
|                                                                             |
 -----------------------------------------------------------------------------

What is this? I used NVHPC-24.1 sdk set.

Thank you.

Re: error during GPU calculations

Posted: Tue May 14, 2024 2:44 pm
by jonathan_lahnsteiner2
Dear sergey_lisenkov1,

Could you please supply your input files otherwise I will not be able to reproduce the problem.

Have you considered the warning messages you are getting? There is one message saying you did not supply a KPOINT file. With no KPOINT file supplied the kpoints will be set up according to the tag KSPACING. I am not sure if this is what you intended.

Another warning message that you get tells you EFIELD_PEAD is too large. You might want to consider reducing this variable.

All the best Jonathan

Re: error during GPU calculations

Posted: Tue May 14, 2024 3:47 pm
by sergey_lisenkov1
Hi Jonathan,

Thanks for reply. I was using KSPACING settings except KPOINTS file.

Regarding a warning for EFIELD_PEAD - usually (on CPU) VASP continues with the job. I'll reduce the value and meantime send you the files.

Re: error during GPU calculations

Posted: Tue May 14, 2024 3:57 pm
by sergey_lisenkov1
jonathan_lahnsteiner2 wrote: Tue May 14, 2024 2:44 pm Dear sergey_lisenkov1,

Could you please supply your input files otherwise I will not be able to reproduce the problem.

Have you considered the warning messages you are getting? There is one message saying you did not supply a KPOINT file. With no KPOINT file supplied the kpoints will be set up according to the tag KSPACING. I am not sure if this is what you intended.

Another warning message that you get tells you EFIELD_PEAD is too large. You might want to consider reducing this variable.

All the best Jonathan
input/output files for the test system are attached. I reduced EFIELD_PEAD values, so the warning is gone, but calculations crashed with the same message.

Sergey

Re: error during GPU calculations

Posted: Thu May 16, 2024 11:12 am
by jonathan_lahnsteiner2
Dear Sergey,

I am able to reproduce the bug you reported. We will resolve this issue as soon as possible.
I will let you know when there is a solution available.

All the best

Jonathan

Re: error during GPU calculations

Posted: Tue May 28, 2024 9:04 am
by francesco_ricci
Hello there.

I just wanted to mention that I also run into the same error message while running a LCALCEPS=True calculation
with HSE functional using vasp-6.4.1 or 6.4.2.
I was able to reproduce the error even with a simple input for Si bulk, with no other warnings during the run.
The same calculation works perfectly on cpus.

If it helps I can provide input and output and further detail on the libraries used for compiling vasp.

Thanks!
FR

Re: error during GPU calculations

Posted: Tue May 28, 2024 3:26 pm
by jonathan_lahnsteiner2
Dear Francesco Ricci,

Thank you for letting us now. We are currently working on a bug fix for this issue.
It would be very helpful if you could upload the input/output files that you were using.
And also to let us know which tool chains you were using.

With many thanks and all the best Jonathan

Re: error during GPU calculations

Posted: Mon Jun 03, 2024 11:37 am
by francesco_ricci
Sure, here is the input for Si. Makefile is also attached if it helps.
hse_si_gpu_bug_FFTBAS_MU.tar.gz
I also tried KPAR=1 and got the same error.

I'm running on NERSC perlmutter supercomputer with these modules loaded:
Currently Loaded Modules:
1) craype-x86-milan 9) python/3.11 (dev) 17) cray-hdf5/1.12.2.3 (io)
2) libfabric/1.15.2.0 10) nvidia/23.9 (g,c) 18) wannier90/3.1.0
3) craype-network-ofi 11) craype/2.7.30 (c) 19) cudatoolkit/12.2 (g)
4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta 12) cray-dsmml/0.2.2 20) craype-accel-nvidia80
5) perftools-base/23.12.0 13) cray-mpich/8.1.28 (mpi) 21) gpu/1.0
6) cpe/23.12 14) cray-libsci/23.12.5 (math) 22) vasp/6.4.2-gpu
7) conda/Miniconda3-py311_23.11.0-2 15) PrgEnv-nvidia/8.4.0 (cpe)
8) evp-patch 16) cray-fftw/3.3.10.6 (math)

Where:
g: built for GPU
mpi: MPI Providers
cpe: Cray Programming Environment Modules
math: Mathematical libraries
io: Input/output software
c: Compiler
dev: Development Tools and Programming Languages


Thanks!

Re: error during GPU calculations

Posted: Wed Jun 05, 2024 1:42 pm
by jonathan_lahnsteiner2
Dear Sergey, Dear Francesco,

I have investigated the issue you reported. There will be a bug fix for this in the next release. The problem occurs when LCALCEPS is used in combination with hybrid functionals running on a GPU. The error arises because the number of wavefunctions being parallelized exceeds the actual number of occupied states. This is indicated by the bug message stating that the batch size is some value. The batch size cannot be larger than the number of occupied states. In the meantime, you can set:

Code: Select all

NBLOCK_FOCK <= number of occupied states
With these settings, your calculations should run as expected. We tested both of the examples you provided, and both completed without any issues.

All the best Jonathan

Re: error during GPU calculations

Posted: Wed Jun 05, 2024 2:34 pm
by sergey_lisenkov1
Thanks a lot!

Re: error during GPU calculations

Posted: Mon Jun 10, 2024 9:04 am
by francesco_ricci
Amazing, thanks a lot for looking into this and share a temporary solution!