Simple Test of an OpenMP Parallelization Directive on a Loop of Independent Calculations Inside a Subroutine
In the past, major speedups in the MASS model code were made possible by computers which perform long vector calculations, such as various Cray supercomputers and the Stardent workstations. The MASS code was quite amenable to this approach because of the prevalence of long "horizontal" loops with simple calculations within the loop such as the following:
DO 100 J=1,NY
DO 100 I=1,NX
T(I,J) = T(I,J) + 273.16
100 CONTINUE
We often would reconfigure (I,J) arrays into one-dimensional arrays to further increase the efficiency gained by vectorization.
Two things have changed in recent years. First, the workstations we have mostly used have gained a great deal of scalar speed with faster clock speeds and pipelining, but vectorization has been mostly limited to supercomputers. Second, parts of the MASS code have changed to become less vectorizable. The major difference is that the SRPH (Surface energy budget, radiation, planetary boundary layer, hydrology) scheme is called a single vertical column at a time. SRPH takes up a significant portion of the total MASS execution time (as much as 60% depending on the configuration), and there are other portions of the code which are similarly structured (cumulus parameterization). An simplified example of such a code structure is shown below:
DO 100 J=1,NY
DO 100 I=1,NX
CALL SUB1D(T(I,J),. . .)
100 CONTINUE
These kind of loops do not vectorize at all, and the automatic parallelization feature available with some compilers also will not do anything with them. So this represents a significant bottleneck which may hinder code efficiency improvements. Since it is a goal of the SECA project to speed up the MASS code, an effort is underway to determine if we can find a way to parallelize these kinds of loops. Theoretically, they should be good candidates for parallelization, because the vertical columns are generally independent of each other as the PBL and other calculations are being carried out. This may not be true however, for the three-dimensional version of the TKE PBL scheme which is used in the version of MASS used for the SECA HCRM work.
Since this kind of code structure will not automatically parallelize, it is necessary to explicitly parallelize it. There are at least a couple of possible approaches. First, a package such as PVM (Parallel Virtual Machine) may be used to insert PVM commands into the code which allows selected tasks to be sent to different processors on the same machine, or even to other networked workstations which are also running PVM. PVM is public domain and widely used, but it is generally known to be fairly difficult to use. There are other packages similar to PVM such as MPI (Message Passing Interface), which might also be employed.
Second, a set of computer companies have formed the OpenMP Application Program Interface, which is described on its web page (http://www.openmp.org) as:
The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.
A key advantage of OpenMP over something like PVM is that it allows parallelization at a higher level of abstraction, so that a relatively simple set of parallelization directives inserted into a code compiled with an OpenMP-compliant compiler may yield significant benefits with minimal code changes.
A Linux/dual-Pentium workstation was purchased to test model performance for the SECA project. The system used was sirocco.meso.ncsu.edu at MESOs office in Raleigh, North Carolina office. sirocco has two 350 MHz Pentium II processors, Windows 98 installed as one partition and Red Hat Linux 5.2 as another. A Fortran 77 (pgf77) compiler from the Portland Group, Inc. was used.. The compiler documentation says:
The PGI shared-memory parallel programming model is a substantial subset of the OpenMP Fortran Application Program Interface. PGI will supply a fully compliant implementation of this standard in a future release.
A simple Fortran program was written to test the ability of an OpenMP directive to efficiently parallelize the case of a subroutine call within a loop similar to the kind used in MASS. The key part is the following loop:
do n=1,10000
do j=1,ny
!$OMP DOACROSS PRIVATE (array1)
do i=1,nx
do k=1,nz
array1(k) = array3(i,j,k)
end do
call column (array1,nzmax,nz,
float(i),float(j))
do k=1,nz
array3(i,j,k) = array1(k)
end do
end do
end do
end do
The outer n loop represents iterations as the model marches in time. The i and j loops represent a typical model loop across both horizontal directions of the model domain. Within these loops, a vertical profile from a three-dimensional variable array3 is extracted into a one-dimensional array array1, which is passed to a subroutine which makes calculations only in the vertical direction. In this test code, the subroutine column just makes some simple calculations which can be checked for accuracy, then some more mathematical calculations just to consume processing time. The calculations in each vertical column are independent of the columns around it at a given time step, so there is no reason that multiple columns could not be handled in parallel by different processors or different machines. DOACROSS line is a one-line OpenMP directive which tells the compiler to parallelize the loop immediately following the directive. "PRIVATE (array1)" tells the compiler to set aside separate memory for the array1 array on each invocation of the loop. Without it, the array1 values are used by both processors simultaneously, resulting in interference and incorrect answers.
Figure 1 shows the performance of a set of executions of the parallel.f program. The method of comparison is to express the speed of the parallel runs relative to the speed of the single processor run. The single processor run (no optimization by the compiler) takes 158 s to run. The use of the Portland Group compiler option Mconcur (auto concurrentization of loops) does not speed up the run because none of the loops parallelize automatically; in fact it is very slightly slower than the single processor run. When it is compiled with the mp option and the DOACROSS OpenMP directive, the code runs 1.92 times faster on the two-processor system, almost a perfect doubling. So the directive is very effective at partitioning the i loop iterations and subroutine column calls among the available processors, and the calculations are correct. It seems likely that this type of parallelization would also scale up to more processors very easily.

Fig. 1. Relative speed of parallel.f test code with various compilations.
Ken Waight MESO, Inc. 20 January, 1999