|
The Open Resilient Cluster Manager (ORCM, or OpenRCM) is an
open-source project focused on development of an "always on" resource
manager for high-performance computing systems of any size. The
objectives for the system are:
- Maintain operation of running applications in the face of single
or multiple failures of any given process within that application.
- Proactively detect incipient failures (hardware and/or software)
and respond appropriately to maintain overall system operation.
- Support both MPI and non-MPI applications.
- Provide a research platform for exploring new concepts and methods
in resilient systems.
It is expected that both open and proprietary elements will be
incorporated into the ORCM system. Thus, the overall architecture of
the system is built upon the ORTE and OPAL layers within the Open MPI
project. Development within the ORCM effort frequently touches both
communities, contributing to improved Open MPI capability as well as
advanced ORCM features.
Several features distinguish OpenRCM from other common resource
managers, including (but not limited to):
- Full utilization of component architecture methods to provide a
platform for research and production code to coexist and be tested in
actual production environments.
- A focus on fault prediction, integration with embedded
state-of-health sensors, and proactive response to both hardware and
software faults.
- Support for dynamic resource addition/subtraction from running
multi-node applications, allowing for "on-the-fly" removal and
replacement of nodes without stopping applications.
- Built-in communications library for resilient applications that
automatically maintains communications in the presence of failed
processes.
- An architecture designed to support platforms ranging from small
embedded multi-processor systems to large-scale high-performance
computing clusters.
Current Status
OpenRCM is currently under development, with an initial release
expected in early 2010. Interested parties are welcome to get a
developer's checkout from our Subversion
repository (sorry, no tarballs available yet). Of course, while we
do our best to ensure the development trunk will always build and run,
we cannot guarantee the stability of that code base. Please feel free
to advise us of problems, and to offer suggestions for improvement, on
the appropriate mailing list.
Instructions on how to build OpenRCM, and the required Open MPI
support, are provided in the HACKERS file at the top of the OpenRCM
code base.
Questions and bugs
Questions, comments, and bugs should be sent to ORCM mailing lists.
Also be sure to see the ORCM wiki and bug tracking
system for information relating to ORCM's design and the project.
History / credits
OpenRCM was originally conceived and developed within Cisco
Systems, Inc. as an advanced resource manager for high-performance
router control systems. Given the potential cross-over application to
the HPC community, and the involvement of several major universities
in the project, the decision was made to release this project as
open-source in the hopes that others may benefit from it and
contribute to its evolution.
|