Cornell
University

Georgia Institute
of Technology

Lawrence Livermore
National Laboratory


Flexible, Reconfigurable, System-wide Monitoring

Forming a building block for autonomous systems

Overview:

Our ability to solve Grand Challenge Problems in computing hinges on the development of reliable and efficient High Performance Computing (HPC) systems. These systems must be autonomous, self-aware, self-adapting, and self-healing. Building such systems requires the existence of flexible, introspective data acquisition mechanisms to determine the state of the system, to detect malfunctions or inefficiencies, and to provide the basis for appropriate adaptation (to steer system configuration and optimization).

In Owl, we are developing a reconfigurable monitoring framework, which will function as one of the fundamental building blocks for such autonomous systems. Owl splits monitoring functionality into two parts: capsules containing reconfigurable logic and analysis modules, which are loaded into the capsules and perform data aggregation and preprocessing. The capsules contain the actual data probes and may be located throughout the system.

Each capsule provides a standardized interface between itself and the reconfigurable logic containing the analysis module. This allows analysis modules to be applied at any system location and thereby enables the reuse of analysis and aggregation techniques. A module's logic may be instantiated from a library of existing modules. Each loaded module may further be configured through memory mapped configuration registers available in each capsule. Once activated, the capsule directs the probed data to the module where it is preprocessed, analyzed, aggregated, or simply compressed. When necessary the module generates output data and this data is injected into the regular system memory traffic and stored in a reserved region of main memory organized as a ring buffer of configurable size.

Figure 1 Figure 2
Monitoring capsules can potentially
located anywhere in the system.
A standardized interface allows
the exchange of monitoring modules.

First results show that a monitoring system with autonomous data delivery has a relatively small impact on system performance, even in the case of logging individual memory accesses, and that with lower injection rates the overhead becomes negligible. In addition, simple hardware techniques can further reduce system perturbation in the general case. Our feasibility studies demonstrate the viability of the general approach. As the framework was designed as a general monitoring facility, we believe its success in the specific context of memory analysis will extend to more pervasive system-wide monitoring - and towards better understanding system behavior.

Contacts:

Funding:

This work is supported by the National Science Foundation under:
NSF Medium ITR/NGS Award (#O325536), Towards Autonomic Computing Platforms: System-wide Hardware/Software Performance Monitoring and Adaptation

To the project internal web site


December 16th, 2004.