|
|
|
The authors present an application of multivariate non-hierarchical statistical clustering to geographic environmental data from the 48 conterminous United States in order to produce maps of regions of ecological similarity, called ecoregions. These maps represent finer scale regionalizations than those generated by the traditional technique: an expert with a marker pen. Nine input variables thought to affect the growth of vegetation are clustered at a resolution of one square kilometer. These data represent over 7.8 million map cells in a 9-dimensional data space. After developing a serial version of the iterative statistical clustering algorithm, the authors developed a parallel version of the algorithm which uses the MPI message passing routines. The parallel algorithm uses a classical self-scheduling single program multiple data (SPMD) organization, performs dynamic load balancing for reasonable performance in heterogeneous metacomputing environments, and provides fault tolerance by saving intermediate results for easy restarts in case of hardware failure. The parallel algorithm was tested on various heterogeneous metacomputing configurations involving three Intel Paragons, an IBM SP, and an SGI Origin 2000. The tests were performed without any code modification, and were made possible by Globus and the Globus-enabled version of MPI (MPICH-G). Preliminary tests indicate that while the algorithm works reasonably well under the metacomputing environment for a moderate number of CPUs, the communication overhead due to Globus can be prohibitive for large CPU configurations. Some remedies for reducing this overhead are discussed.
Statistical clustering is the division, arrangement, or classification of a number of (non-identical) objects into subgroups or categories based on their similarity. Hierarchical clustering provides a series of divisions, based on some measure of similarity, into all possible numbers of groups, from one single group which contains all objects to as many groups as there are objects, with each object being in a group by itself. Hierarchical clustering, which results in a complete similarity tree, is computer-intensive; therefore, the assemblage to be classified is limited to relatively few objects.
Non-hierarchical clustering provides only a single, user-specified level of division into groups; however, it can be used to classify a much larger number of objects. Multivariate geographic clustering employs non-hierarchical clustering on the individual pixels in a digital map from a Geographic Information System (GIS) for the purpose of classifying the cells into types or categories. The classification of satellite imagery into land cover or vegetation classes using spectral characteristics of each cell from multiple images taken at different wavelengths is a common example of multivariate geographic clustering. Rarely, however, is non-hierarchical clustering performed on map cell characteristics aside from spectral reflectance values.
Maps showing the suitability or characterization of regions are used for many purposes, including identifying appropriate ranges for particular plant and animal species, identifying suitable crops for an area or identifying a suitable area for a given crop, and identifying Plant Hardiness Zones for gardeners. In addition, ecologists have long used the concept of the ecoregion, an area within which there are similar ecological conditions, as a tool for understanding large geographic areas (Bailey, 1983, 1994, 1995, 1996; Omernick, 1986). Such regionalization maps, however, are usually prepared by individual experts in a rather subjective way, and are essentially objectifications of expert opinion. Our goal was to make repeatable the process of map regionalization based not on spectral cell characteristics, but on characteristics identified as important to the growth of woody vegetation. By using non-hierarchical multivariate geographic clustering we intended to produce several maps of ecoregions across the entire nation at a resolution of one kilometer per cell. At this resolution, the 48 conterminous United States contains over 7.8 million map cells. Nine characteristics from three categories--elevation, edaphic (or soil) factors, and climatic factors--were identified as important. The edaphic factors are 1) plant-available water capacity, 2) soil organic matter, 3) total Kjeldahl soil nitrogen, and 4) depth to seasonally-high water table. The climatic factors are 1) mean precipitation during the growing season, 2) mean solar insolation during the growing season, 3) degree-day heat sum during the growing season, and 4) degree-day cold sum during the non-growing season. The growing season is defined by the frost-free period between mean day of first and last frost each year. A map for each of these characteristics was generated from best-available data at a 1 sq km resolution for input into the clustering process (Hargrove and Luxmoore, 1998). Given the size of this input data and the significant amount of computer time typically required to perform statistical clustering, we decided a parallel computer was needed for this task.
The Globus metacomputing tool kit (Foster and Kesselman, 1997) (see www.globus.org) and the Globus-enabled version of MPICH (Gropp, et.al., 1996), called MPICH-G (see www-unix.mcs.anl.gov/mpi/mpich), is used in this implementation. MPICH-G provides a transparent interface to run MPI based applications under a heterogeneous metacomputing environment. Globus and MPICH-G were first installed and tested on the Intel Paragons at Oak Ridge National Laboratory (ORNL). Globus was already available on the National Center for Supercomputing Application (NCSA) Silicon Graphics (SGI) Origin 2000 and the Argonne National Laboratory (ANL) IBM SP, so no new installation of Globus was required for these machines. However, MPICH-G had to be installed on these machines.
Considerable work was expended in getting Globus and MPICH-G working on the Paragons because it was the first successful Globus installation on an Intel Paragon. The communication component of Globus is called Nexus. Nexus can use the native communication library within a parallel architecture (eg. MPI, MPL, NX) and TCP/IP or UDP for communication across different machines. The Paragon's communication library, NX, was not fully implemented within Nexus with the original Globus distribution. Working with the Globus developers, we completed the Nexus communication module to include the INX protocol for the Paragons. This facilitated more efficient communication within the Paragons. Previously, communication within the Paragon was only possible with the using normal TCP/IP.
The clustering application code was compiled using MPICH-G on the Intel Paragons, SGI Origin 2000, and the IBM SP. The MPICH-G was built to use the INX communication protocol within the Paragons, MPL within the IBM SP, and SGI's native MPI within the Origin 2000. The communication across different architectures were facilitated by TCP/IP.
In our implementation of non-hierarchical clustering, the characteristic values of the 9 input variables are used as coordinates to locate each of the 7.8 million map cells in a 9-dimensional environmental data space. The map cells can be thought of as galaxies of "unit-mass stars" fixed within this 9-dimensional volume. The density of "unit-mass stars" varies throughout the data space. "Stars" which are close to each other in data space have similar values of the nine input variables, and might, as a result, be included in the same map ecoregion or "galaxy." The clustering task is to determine, in an iterative fashion, which "stars" belong together in a "galaxy," the number of which is specified by the user. The coordinates of a series of "galaxy" centroids, or its "centers of gravity," are calculated after each iteration, allowing the "centers of gravity" to "walk" to the most densely populated parts of the data space.
The non-hierarchical algorithm, which is nearly perfectly parallelizable, consists of two parts: initial centroid determination, called seed finding, and iterative clustering until convergence is reached. The algorithm begins with a series of "seed" centroid locations in data space--one for each cluster desired by the user. In the iterative part of the algorithm, each map cell is assigned to the cluster whose centroid is closest, by simple Euclidean distance, to the cell. After all map cells are assigned to a centroid, new centroid positions are calculated for each cluster using the mean values for each coordinate of all map cells in that cluster. The iterative classification procedure is repeated, each time using the newly recalculated mean centroids, until the number of map cells which change cluster assignments within a single iteration is smaller than a convergence threshold. Once the threshold is met, the final cluster assignments are saved.
Seed centroid locations are ordinarily established using a set of rules which sequentially examine the map cells and attempt to preserve a subset of them which are as widely-separated in data space as possible. This inherently serial process is difficult to parallelize; if the data set is divided equally among N nodes, and each node finds the best seeds among it's portion of the cells, and then a single node finds the "best-of-the-best," this set of seeds is not as widely dispersed as a single serially-obtained seed set. On the other hand, the serial seed-finding process is quite slow on a single node, while the iterations are relatively fast in parallel. It is foolish, in terms of the time to final solution, to spend excessive serial time polishing high-quality initial seeds, since the centroids can "walk" relatively quickly to their ultimate locations in parallel. Thus, we opted to implement this "best-of-the-best" parallel seed finding algorithm. It has proven to produce reasonably good seeds very quickly.
The iterative portion of the algorithm is implemented in parallel using the MPI message passing routines by dividing the total number of map cells into parcels or aliquots, such that the number of aliquots is larger than the number of nodes. We employ a classical self-scheduling (master/slave) relationship among nodes and perform dynamic load balancing because of the heterogeneous nature of the metacomputing environment on which the algorithm is run. This dynamic load balancing is achieved by having a single master node act as a "card dealer" by first distributing the centroid coordinates, and then distributing an aliquot of map cells to all nodes. Each slave node assigns each of its map cells to a particular centroid, then reports the results back to the master. If there are additional aliquots of map cells to be processed, the master will send a new aliquot to this slave node for assignment. In this way, faster and less-busy nodes are effectively utilized to perform the majority of the processing. If the load on the nodes changes during a run, the distribution of the work load will automatically be shifted away from busy or slow nodes onto idle or fast nodes. At the end of each iteration, the master node computes the new mean centroid positions from all assignments, and distributes the new centroid locations to all nodes, along with the first new aliquot of map cells. Because all nodes must be coordinated and in-step at the beginning of each new iteration, the algorithm is inherently self-synchronizing.
If the number of aliquots is too low (i.e., the aliquot size is too large), the majority of nodes may have to wait for the slowest minority of nodes to complete the assignment of a single aliquot. On the other hand, it may be advantageous to exclude particularly slow nodes so that the number of aliquots, and therefore the amount of inter-node communication, is also reduced often resulting in shorter run times. Large aliquots work best for a parallel machine with few and/or homogeneous nodes or very slow inter-node communication, while small aliquots result in better performance on machines with many heterogeneous nodes and fast communication. Aliquot size is a manually tunable parameter, which makes the code portable to various architectures, and can be optimized by monitoring the waiting time of the master node in this algorithm.
In order to provide some fault-tolerance, the master node saves centroid coordinates to disk at the end of each iteration. If one or more nodes fails or the algorithm crashes for some reason, the program can simply be restarted using the last-saved centroid coordinates as initial seeds, and processing will resume in the iteration in which the failure occurred.
For the tests performed in this study the following machines are used in various combinations:
For the final paper, we will include a heterogeneous PC cluster named the "Stone Soupercomputer" (Hoffman and Hargrove, 1999) to the above mix.
Preliminary tests were performed to study the performance of the
code under different metacomputing configurations. Some initial tests
were performed to evaluate the Globus overhead on the Intel Paragons. A
16 CPU local run on the XPS/5 indicated that the Globus overhead was
about 10% compared to a version compiled without Globus. However, when
the number of CPUs were increased to 64, the Globus version was about
20 times slower than the non-Globus version. Further analysis indicated
that most of the Globus slow-down was due to frequent polling by the
CPUs to check for incoming messages. When the polling frequency was
reduced by increasing the GLOBUS_NEXUS_SKIP_POLL
environment variable from the default 100 to 10 000, the slowdown was
improved from 20 times to about 1.8 times. The other primary parameter
that was adjusted to improve performance is the number of aliquots
(naliquot) parameter described above, which is input to
the clustering code. This parameter was reduced from 2300 to 512
resulting in a Globus overhead of about 20% for the 64 CPU
configuration.
Our first heterogeneous test used a simple configuration with moderate number of CPUs on 4 different machines: 4 CPUs each on the Intel Paragon XPS/5 at ORNL, the Intel Paragon XPS/35 at ORNL, the IBM SP at ANL, and the SGI Origin 2000 at NCSA. We performed several tests by progressively adding each machine to the configuration. Following are the preliminary timings for 1 iteration of the clustering algorithm:
For the final paper we will present more detailed results with
performance numbers for various large CPU configurations. We will also
examine the effects of various parameters such as the Globus polling
frequency and naliquot on the various configurations.
The clustering algorithm was used to generate maps with 1000, 2000, and 3000 ecoregions. These maps appear to capture the ecological relationships among the nine input variables. This multivariate geographic clustering can be used as a way to spatially extend the results of ecological simulation models by reducing the number of runs needed to obtain output over large areas. Simulation models can be run on each relatively homogeneous cluster rather than on each individual cell. The clustered map can be populated with simulated results cluster by cluster, like a paint-by-number picture. This cluster fill-in simulation technique has been successfully used by the Integrated Modeling Project to assess the health and productivity of southeastern forests.
![]() |
| Figure 2: National map clustered on elevation, edaphic, and climate variables into 3000 ecoregions using similarity colors. |
Finally, we have demonstrated that while special considerations must be made for algorithms running on relatively-large heterogeneous systems and metacomputing environments, like dynamic load balancing and fault tolerance, they are not difficult to implement and will result in enhanced performance not just on a single parallel machine, but can be tuned for many different architectures.
__________
*Oak Ridge National Laboratory, managed by Lockheed
Martin Energy Research Corp. for the U.S. Department of Energy under
contract number DE-AC05-96OR22464.
| "The submitted manuscript has been authored by a contractor of the U.S. Government under contract No. DE-AC05-96OR22464. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes." |