Walkers Application Performance on the Penguin Cluster

Forrest M. Hoffman and William W. Hargrove
November 21, 2003

A performance test was performed to determine the optimal single-node configuration for the Walkers application on the Penguin Cluster. The Penguin Cluster consists of 22 Intel nodes (9 dual 1.0GHz Pentium IIIs, 3 dual 1.4GHz Pentium IIIs, and 10 dual 3.06GHz Xeons) interconnected with Dolphinics SCI cards using Scali MPI software. The application simultaneously uses both MPI (with a Master/Slave relationship) for distributed memory work allocation and POSIX threads (pthreads) for shared memory work distribution within a node. The user may choose the number of threads for slave nodes using a command-line flag. The purpose of this test is to 1) determine the performance impact of using a shared or private copy of the "footprints" map, 2) quantify the performance improvement using the new Xeon processors, and 3) quantify the performance impacts of the Xeon processor HyperThreading for this application.

Walkers Application Performance on Penguin Cluster
using Dual 1.4GHz Pentium III Master running Red Hat 7.3
2 threads
private map
4 threads
private map
2 threads
shared map
4 threads
shared map
Dual 1.4GHz Pentium III Slave 36:30.298 41:57.784 36:34.177 39:46.383
Dual 3.06GHz Xeon Slave
HyperThreading disabled
12:14.554 20:31.110 11:27.812 18:17.366
Dual 3.06GHz Xeon Slave
HyperThreading enabled
12:55.238 22:01.623 11:52.192 19:27.315
Walkers Application Performance on Penguin Cluster

These results show that two threads are better than four in all cases for these nodes, and that sharing the "footprints" map does not have a detrimental effect on performance. In fact, sharing the "footprints" map actually seems to improve performance slightly, especially with four threads. Using private copies for the "footprints" map more quickly increases the memory footprint as the number of threads increases; however, on machines with many SMP processors, contention for updating a shared "footprints" map could become a problem.

While the performance difference between the Pentium III and the Xeon was around a factor of three (as expected) when using two threads, it was very close to a factor of (only) two when using 4 threads. The reason for this difference is unclear. Never the less, the factor of 3.2 between the Pentium III and the Xeon for the two thread, shared map case is excellent.

Of particular interest is the fact that HyperThreading (when enabled in the kernel of the dual Xeon node) actually has a small negative effect on performance. HyperThreading reduces performance resulting in a slighly larger run time. Moreover, this effect is more pronounced when using four threads. This result is somewhat surprising since HyperThreading was designed to improve performance on threaded applications such as the Walkers code.


Forrest Hoffman (forrest@climatemodeling.org)