very slow parallel version on EM64t cluster

Message

igerber · #1 Post by **igerber** » Wed Apr 26, 2006 10:08 am

Hi everyone,

I'm currently trying to compile vasp.4.6 on EM64t xeon machines...
I've succeeded in the serial version with the intel 9.0 compiler (fce),
within the 64bit mode...
However, I've got into trouble with the parallel version.

Here is my makefile:

.SUFFIXES: .inc .f .f90 .F
#-----------------------------------------------------------------------
# Makefile for Intel Fortran compiler for P4 systems
#
# The makefile was tested only under Linux on Intel platforms
# (Suse 5.3, libc 6 <-> glibc 2.X)
#
# it might be required to change some of library pathes, since
# LINUX installation vary a lot
# Hence check ***ALL**** options in this makefile very carefully
#-----------------------------------------------------------------------
#
# Mind that some Linux distributions (Suse 6.1) have a bug in
# libm causing small errors in the error-function (total energy
# is therefore wrong by about 1meV/atom). The recommended
# solution is to update libc.
#
# BLAS must be installed on the machine
# there are several options:
# 1) very slow but works:
# retrieve the lapackage from ftp.netlib.org
# and compile the blas routines (BLAS/SRC directory)
# please use g77 or f77 for the compilation. When I tried to
# use pgf77 or pgf90 for BLAS, VASP hang up when calling
# ZHEEV (however this was with lapack 1.1 now I use lapack 2.0)
# 2) most desirable: get an optimized BLAS
# for a list of optimized BLAS try
# http://www.kachinatech.com/~hjjou/scilib/opt_blas.html
#
# the two most reliable packages around are presently:
# 3a) Intels own optimised BLAS (PIII, P4, Itanium)
# http://developer.intel.com/software/products/mkl/
# this is really excellent when you use Intel CPU's
#
# 3b) or obtain the atlas based BLAS routines
# http://math-atlas.sourceforge.net/
# you certainly need atlas on the Athlon, since the mkl
# routines are not optimal on the Athlon.
# If you want to use atlas based BLAS, check the lines around LIB=
#
# 3c) brand new and mindblowing fast SSE (4 GFlops on P4, 2.53 GHz)
# Kazushige Goto's BLAS
# http://www.cs.utexas.edu/users/kgoto/signup_first.html
#
#
#-----------------------------------------------------------------------

# all CPP processed fortran files have the extension .f
SUFFIX=.f90

#-----------------------------------------------------------------------
# fortran compiler and linker
#-----------------------------------------------------------------------
FC=/opt/intel/fce/9.0/bin/ifort
# fortran linker
FCL=$(FC)

#-----------------------------------------------------------------------
# whereis CPP ?? (I need CPP, can't use gcc with proper options)
# that's the location of gcc for SUSE 5.3
#
# CPP_ = /usr/lib/gcc-lib/i486-linux/2.7.2/cpp -P -C
#
# that's probably the right line for some Red Hat distribution:
#
# CPP_ = /usr/lib/gcc-lib/i386-redhat-linux/2.7.2.3/cpp -P -C
#
# SUSE X.X, maybe some Red Hat distributions:

CPP_ = ./preprocess <$*.F | /usr/bin/cpp -P -C -traditional -m64 >$*$(SUFFIX)

#-----------------------------------------------------------------------
# possible options for CPP:
# NGXhalf charge density reduced in X direction
# wNGXhalf gamma point only reduced in X direction
# avoidalloc avoid ALLOCATE if possible
# IFC work around some IFC bugs
# CACHE_SIZE 1000 for PII,PIII, 5000 for Athlon, 8000-12000 P4
# RPROMU_DGEMV use DGEMV instead of DGEMM in RPRO (usually faster)
# RACCMU_DGEMV use DGEMV instead of DGEMM in RACC (faster on P4)
#-----------------------------------------------------------------------

CPP = $(CPP_) -DHOST=\"LinuxIFC\" \
-Dkind8 -DNGXhalf -DCACHE_SIZE=8000 -DPGF90 -Davoidalloc \
# -DNBLK_default=64
-DRPROMU_DGEMV -DRACCMU_DGEMV -DNBLK_default=64

#-----------------------------------------------------------------------
# general fortran flags (there must a trailing blank on this line)
#-----------------------------------------------------------------------

FFLAGS = -FR -lowercase -quiet

#-----------------------------------------------------------------------
# optimization
# we have tested whether higher optimisation improves performance
# -xW SSE2 optimization
# -axW SSE2 optimization, but also generate code executable on all mach.
# -tpp7 P4 optimization
# -prefetch
#-----------------------------------------------------------------------

OFLAG=-O3 -unroll0 -tpp7

OFLAG_HIGH = $(OFLAG)
OBJ_HIGH =
OBJ_NOOPT =
DEBUG = -FR -O0
INLINE = $(OFLAG)

#-----------------------------------------------------------------------
# the following lines specify the position of BLAS and LAPACK
# on P4, VASP works fastest with Intels mkl performance library
# so that's what I recommend
#-----------------------------------------------------------------------

# Atlas based libraries
#ATLASHOME= $(HOME)/archives/BLAS_OPT/ATLAS/lib/Linux_P4SSE2/
#BLAS= -L$(ATLASHOME) -lf77blas -latlas

# use specific libraries (default library path points to other libraries)
#BLAS= $(ATLASHOME)/libf77blas.a $(ATLASHOME)/libatlas.a

# use the mkl Intel libraries for p4 (www.intel.com)
# mkl.5.1
# set -DRPROMU_DGEMV -DRACCMU_DGEMV in the CPP lines
#BLAS=-L/opt/intel/mkl/lib/32 -lmkl_p4 -lpthread

# mkl.5.2 requires also to -lguide library
# set -DRPROMU_DGEMV -DRACCMU_DGEMV in the CPP lines
#BLAS=-L/opt/intel/mkl/lib/32 -lmkl_p4 -lguide -lpthread

# even faster Kazushige Goto's BLAS
# http://www.cs.utexas.edu/users/kgoto/signup_first.html
BLAS= /home/vasp/GotoBLAS/libgoto_prescott-r1.00.so

# LAPACK, simplest use vasp.4.lib/lapack_double
LAPACK= ../vasp.4.lib/lapack_double.o

# use atlas optimized part of lapack
#LAPACK= ../vasp.4.lib/lapack_atlas.o -llapack -lcblas

# use the mkl Intel lapack
#LAPACK= -lmkl_lapack

#-----------------------------------------------------------------------

LIB = -L../vasp.4.lib -ldmy \
../vasp.4.lib/linpack_double.o $(LAPACK) \
$(BLAS)

# options for linking (for compiler version 6.X) nothing is required
LINK =
# compiler version 7.0 generates some vector statments which are located
# in the svml library, add the LIBPATH and the library (just in case)
#LINK = -L/opt/intel/compiler70/ia32/lib/ -lsvml

#-----------------------------------------------------------------------
# fft libraries:
# VASP.4.6 can use fftw.30 (http://www.fftw.org)
# since this version is faster on P4 machines, we recommend to use it
#-----------------------------------------------------------------------

FFT3D = fft3dfurth.o fft3dlib.o
#FFT3D = fftw3d.o fft3dlib.o /home/vasp/fftw-3.0.1/lib/libfftw3.a

#=======================================================================
# MPI section, uncomment the following lines
#
# one comment for users of mpich or lam:
# You must *not* compile mpi with g77/f77, because f77/g77
# appends *two* underscores to symbols that contain already an
# underscore (i.e. MPI_SEND becomes mpi_send__). The pgf90
# compiler however appends only one underscore.
# Precompiled mpi version will also not work !!!
#
# We found that mpich.1.2.1 and lam-6.5.X are stable
# mpich.1.2.1 was configured with
# ./configure -prefix=/usr/local/mpich_nodvdbg -fc="pgf77 -Mx,119,0x200000" \
# -f90="pgf90 -Mx,119,0x200000" \
# --without-romio --without-mpe -opt=-O \
#
# lam was configured with the line
# ./configure -prefix /usr/local/lam-6.5.X --with-cflags=-O -with-fc=pgf90 \
# --with-f77flags=-O --without-romio
#
# lam was generally faster and we found an average communication
# band with of roughly 160 MBit/s (full duplex)
#
# please note that you might be able to use a lam or mpich version
# compiled with f77/g77, but then you need to add the following
# options: -Msecond_underscore (compilation) and -g77libs (linking)
#
# !!! Please do not send me any queries on how to install MPI, I will
# certainly not answer them !!!!
#=======================================================================
#-----------------------------------------------------------------------
# fortran linker for mpi: if you use LAM and compiled it with the options
# suggested above, you can use the following line
#-----------------------------------------------------------------------
FC=/usr/local/lam-intel/bin/mpif77
FCL=$(FC)

#-----------------------------------------------------------------------
# additional options for CPP in parallel version (see also above):
# NGZhalf charge density reduced in Z direction
# wNGZhalf gamma point only reduced in Z direction
# scaLAPACK use scaLAPACK (usually slower on 100 Mbit Net)
#-----------------------------------------------------------------------

CPP = $(CPP_) -DMPI -DHOST=\"LinuxIFC\" -DIFC \
-Dkind8 -DCACHE_SIZE=8000 -DPGF90 -Davoidalloc \
-DRPROMU_DGEMV -DRACCMU_DGEMV
# -DNBLK_default=64 -DRPROMU_DGEMV -DRACCMU_DGEMV
# -DPROC_GROUP=8 \

#-----------------------------------------------------------------------
# location of SCALAPACK
# if you do not use SCALAPACK simply uncomment the line SCA
#-----------------------------------------------------------------------

BLACS=$(HOME)/archives/SCALAPACK/BLACS/
SCA_=$(HOME)/archives/SCALAPACK/SCALAPACK

SCA= $(SCA_)/libscalapack.a \
$(BLACS)/LIB/blacsF77init_MPI-LINUX-0.a $(BLACS)/LIB/blacs_MPI-LINUX-0.a $(BLACS)/LIB/blacsF77init_MPI-LINUX-0.a

SCA=

#-----------------------------------------------------------------------
# libraries for mpi
#-----------------------------------------------------------------------

LIB = -L../vasp.4.lib -ldmy \
../vasp.4.lib/linpack_double.o $(LAPACK) \
$(SCA) $(BLAS)

# fftw.3.0 is slighly faster and should be used if available
FFT3D = fftmpi.o fftmpi_map.o fft3dlib.o
#FFT3D = fftmpi.o fftmpi_map.o fft3dlib.o /home/vasp/fftw-3.0.1/lib/libfftw3.a
#-----------------------------------------------------------------------
# general rules and compile lines
#-----------------------------------------------------------------------
....

The big problem is, as soon as I use more than one CPU
the elapsed time of the Cu bench is just exploding:
For instance:

With one CPU I get:

General timing and accounting informations for this job:
========================================================

Total CPU time used (sec): 9.896
User time (sec): 9.608
System time (sec): 0.289
Elapsed time (sec): 12.375

Maximum memory used (kb): 0.
Average memory used (kb): 0.

Minor page faults: 61028
Major page faults: 0
Voluntary context switches: 641

While with 2 CPUs on the same node, or on different nodes, I obtain:

General timing and accounting informations for this job:
========================================================

Total CPU time used (sec): 7.490
User time (sec): 6.674
System time (sec): 0.816
Elapsed time (sec): 74.036

Maximum memory used (kb): 0.
Average memory used (kb): 0.

Minor page faults: 33496
Major page faults: 0
Voluntary context switches: 68884

How to explain such a difference ?
How can I investigate the source of such a drama ?

Thanks in advance...

Iann

tjf · #2 Post by **tjf** » Wed Apr 26, 2006 11:53 am

Are there actually two real CPUs? (I'm looking at the context switch numbers and trying crazy ideas first!)

igerber · #3 Post by **igerber** » Wed Apr 26, 2006 12:59 pm

Yes, there are...
And almost the same timing is obtained when I use two CPUS
from different nodes...