AppLogic 2.7/2.8 Documentation The latest production release is AppLogic 3.0.30
AppLogic High-Availability
Overview
AppLogic is a self-healing grid operating system which improves the availability of applications and services. High-availability in AppLogic refers to a set of features which together provide:
- Automated recovery from application and grid failures. Unlike in traditional systems, no manual user intervention is required.
- Detection of potential problems in order to prevent subsequent application/grid failures.
One of the most important features of AppLogic's high-availability is the automated application/grid recovery from various types of failures. For applications that are running on AppLogic, automated recovery is provided for AppLogic appliances that may fail due to either appliance software failure or server failure (either physical server failure or a software bug that leads to a server failure). In addition, an AppLogic grid is able to tolerate failures of its grid controller. If the AppLogic grid controller fails, it is either restarted on the same server (if possible) or a different server while minimizing the impact on running applications.
AppLogic also proactively detects various types of issues and alerts the user before those issues can cause application or grid failures/downtime. AppLogic automatically detects the following types of problems:
- application failures following a server failure due to insufficient grid resources (i.e., not enough resources to restart one or more running appliances)
- application failures following a server failure due to the use of degraded volumes
- server hard disk errors or failures
- unavailability of the AppLogic grid controller following a failure of the physical grid controller server
- detection of improper grid configuration related to AppLogic's high-availability features or user alerts
The following sections describe AppLogic's high availability features in more detail. Information is also included regarding the effects of various failures on the applications/grid and what a user may experience during such failures.
Automated Recovery of Applications and Services
Server Failure
AppLogic can automatically recover from the loss of one or more physical servers. A physical server that is part of an AppLogic grid may fail for any of the following reasons:
- spontaneous restart of the server: kernel crash, reboot/reset/power-cycle
- complete failure of the server: burned out motherboard, bad power-supply, bad NIC
- partial failure of the server: intermittent hard disk failure
- partial network connection failure of the server: bad cable, in the case of non-HA network
In order to help tolerate server failures, AppLogic mirrors all volumes across the servers of a grid (by default all volumes are mirrored by 2). Volume mirroring makes it possible for appliances to be able to sustain operation through a physical server failure; unless of course the appliance is running on the failed server.
AppLogic detects a failed server by the loss of the server's network connection to the grid controller (typically within 3 minutes of the server failure). When the failure is detected, any appliances that were running on that server are automatically scheduled to run on other servers in the grid. Note that appliances can only be restarted as long as there are enough available resources in the grid. AppLogic displays an alert on the grid dashboard if there are not enough available resources to restart appliances upon server failure. If this alert is present on the grid dashboard, please contact your service provider so additional servers can be added to the grid.
-
There are not enough available resources to restart components running on n server(s) [list_of_servers].
Upon server failure and the automatic restart of appliances, AppLogic posts recovery alerts on the grid dashboard. As an example, a user will see the following alerts upon the failure of srv3 in their grid (assuming that there were appliances running on srv3 and there are enough available grid resources to restart the appliances):
-
Lost connection to server 'srv3' on date.
-
Appliance 'comp_name' failed due to server lost at date. Lost connection to server srv3.
-
Restarting appliance 'comp_name' on date due to server failure.
- (the previous two messages are repeated for each appliance that needs to be restarted)
When an appliance is successfully restarted after the server failure, the previous two alerts are destroyed and the following alert is posted on the grid dashboard:
-
Restarted appliance 'comp_name' on date due to server failure.
- (the previous message is repeated for each appliance that was restarted successfully)
If AppLogic is unable to restart one or more appliances, one of the following alerts is posted on the grid dashboard for each failed appliance. Please use the list log command to view the controller log for details on exactly why the appliance failed to be started.
-
Failed to restart appliance 'comp_name' on date after server failure. Failed to allocate resources for appliance.
-
Failed to restart appliance 'comp_name' on date after server failure. Appliance restart failed.
Server flapping
If a physical server fails 3 times within a 24 hour period (known as server flapping), AppLogic automatically disables that server (using the srv disable command) . This prevents resources from being scheduled on the server since it is likely going to fail again. The server may be re-enabled using the srv enable command. When the server is re-enabled, it must fail 3 times within a 24 hour period before it is automatically disabled again.
The following alert is posted to the grid dashboard when AppLogic detects that a server is flapping:
-
Server server was disabled on date because it has gone down too often within the specified time period.
Degraded volumes resulting from a server failure and automated volume repair
After a server failure, AppLogic volumes that had a mirror on the failed server become degraded (assuming those volumes have mirrors on other available servers). As of AppLogic 2.7+, AppLogic automatically repairs degraded volumes. In previous AppLogic versions, volume repair was a manual operation that was executed by the user or grid maintainer. AppLogic volumes are automatically repaired in the following priority order to ensure that the most important volumes are repaired first which reduces the risk of application/grid downtime:
- system volumes (
boot, meta, impex)
- application user volumes
- application local catalog volumes
- global catalog and
_GLOBAL volumes
-
volcache volumes
- all other volumes
After a server failure, AppLogic waits 4 hours before it attempts to repair the volumes that were degraded as a result of the server failure. This gives the server a chance to be recovered so the degraded volumes can be recovered on the same server in which their streams were originally assigned. If the failed server is not recovered within 4 hours, AppLogic repairs the degraded volumes using the other servers in the grid. Volumes that immediately need repair after a server failure can be initiated by the user using the vol repair vol_name --force command.
AppLogic's automated volume repair runs once every 6 hours to collect the list of degraded volumes and initiate repairs over those volumes. AppLogic users do not have to do anything to repair degraded volumes; volume repair is now automatic. The user can instruct AppLogic to retrieve the current list of degraded volumes by executing vol check; this can be used to ensure that the current list of degraded volumes are scheduled for repair.
If there are particular volumes that should not be repaired or the repair should be delayed, you may suspend the repair of the volume or all volumes by using the vol suspend command mentioned below. Note that you can only suspend the volume repairs for a maximum of 1 week. The maximum volume suspend time can be configured by the grid maintainer. Please see the automated volume repair configuration topic for more information (this topic is accessible only to grid maintainers).
In addition to automatic volume repair, AppLogic allows the user to execute the following volume maintainence operations:
- Initiate repairs over specific volumes right away:
vol repair vol_name --force
- Suspend the repair of a specific volume for the specified amount of time:
vol repair vol_name --suspend time=time
- Resume the repair of a suspended volume right away:
vol repair vol_name --resume
- Retrieve the current repair status for a specific volume:
vol repair vol_name --status
- Retrieve the current state of all degraded volumes and schedule repairs:
vol check
The suspend, resume and status operations may also be executed over all volumes by omitting the volume name. Please see the vol repair CLI reference for more information.
Server rebooting and power control
This section pertains only to Applogic 2.7+.
Beginning with AppLogic 2.7, servers can reboot under the control of AppLogic in any of the following three cases:
- when the server is rebooted by the user either from a grid reboot (
grid reboot) or a server reboot (srv reboot)
- when the server loses connection to the grid controller and does not reconnect to the grid controller within 1 minute
- in this case AppLogic uses the server's management control to power-cycle the server (if power control is available)
- when the server is unable to communicate with the other servers in the grid; usually caused by a NIC/hardware failure on the server
- a server in this state is known as an isolated server as it cannot communicate with any other server in the grid
If a grid is configured to use server management (i.e., power control), AppLogic automatically power-cycles servers that lose connection to the grid controller as stated above. This is used for the faster recovery of appliances in case of server failure. In addition, the user can take advantage of the server management to execute any of the following server operations:
-
srv power_off: power down the specified server; this can be used to save power for servers that are not being used or to power-down faulty servers
-
srv power_on: power up the specified server
-
srv power_cycle: power-cycle the specified server; useful for rebooting a non-responsive server to get it back online
Please see the server CLI reference for more information about the server power control commands.
To check if your grid has server management enabled, execute srv info --extended. The server management support is specified by the Management Enabled field in the output.
AppLogic currently supports only IPMI-based server management.
Appliance Failure
AppLogic automatically restarts appliances that have crashed/shutdown unexpectingly. In such cases, AppLogic restarts the appliance on the same server where it was running; with exactly the same resources and settings as the appliance had before it had failed. Currently, AppLogic detects failed appliances if the appliance's virtual machine disappears from the server; this occurs if the appliance crashes or is shutdown/rebooted. Typically, the appliance failure is detected and the appliance is restarted within 1 minute. Note that the appliance restart time also depends on how long it takes for the appliance itself to boot.
When an appliance failure is detected, the following alert is posted on the grid dashboard:
-
Restarting appliance 'comp_name' on date due to appliance failure.
- (the previous message is repeated for each failed appliance that needs to be restarted)
When an appliance is successfully restarted after failure, the previous alert is destroyed and the following alert is posted on the grid dashboard:
-
Restarted appliance 'comp_name' on date due to appliance failure.
- (the previous message is repeated for each appliance that was restarted successfully)
If AppLogic is unable to restart one or more appliances, the following alert is posted on the grid dashboard for each failed appliance. Please use the list log command to view the controller log for details on exactly why the appliance failed to be started.
-
Failed to restart appliance 'comp_name' on date after appliance failure.
If an appliance needs to be rebooted/shutdown without AppLogic detecting that as an appliance failure, the field engineering code 64 can be used for this purpose. Please see the field engineering code reference for more information.
AppLogic cannot detect internal appliance failures such as software crashes, software malfunctions/bugs, low memory conditions, etc. 3tera is planning on adding this type of appliance failure detection in a future AppLogic release.
Appliance flapping
If an appliance fails 3 times within a 24 hour period (known as appliance flapping), AppLogic does not automatically restart that appliance; the appliance is left in the STANDBY state. The appliance may be manually started by a user using comp start. When the appliance is restarted (either by comp start or app restart), it must again fail 3 times within a 24 hour period before AppLogic decides not to restart the failed appliance automatically.
The following alert is posted to the grid dashboard when AppLogic detects that an appliance is flapping:
-
Appliance 'comp_name' restart failed at date. Appliance has failed too often within the specified time period.
Appliance flapping does not apply to appliances that use field engineering code 64 (appliance reboots/shutdowns are ignored by AppLogic).
Automated Recovery of the AppLogic Grid Controller
This section pertains only to Applogic 2.7+. Previous AppLogic versions cannot tolerate failures of the grid controller and such failures would result in a full grid reboot and application downtime.
Beginning with AppLogic 2.7, AppLogic can now tolerate failures of the grid controller with minimal to no application downtime (the grid controller is no longer a single point of failure for the grid). This section describes how AppLogic tolerates grid controller failures and the affects on the applications and the grid.
Grid Controller Failure
AppLogic automatically recovers from various types of failures of the grid controller virtual machine that runs on the grid's primary server (the primary server is the server that runs the grid controller virtual machine; each AppLogic grid has one and only one primary server). Recovery from a failed grid controller has no affect on the applications that are running on the grid. AppLogic monitors the grid controller and automatically detects and handles any of the following software failure conditions that may lead to grid controller downtime:
- crash of the grid controller
- unexpected shutdown/reboot of the grid controller
- unresponsive grid controller
- corruption of the grid controller
boot or meta volumes
- out of memory errors within the grid controller
When any of the above failures occur, the grid controller is automatically restarted on the primary server without affecting any of the running applications. From a visibility standpoint, the grid controller will be unavailable for under 5 minutes while the controller recovery is in progress. Once the grid controller has recovered from the failure, it automatically reacquires the state of the grid and continues operation as if the failure never occurred. An alert is posted to the grid dashboard that conveys the reason why the grid controller had failed. Please see the grid controller dashboard messages for a full list of the alerts that can be posted.
Like the physical server failures and the appliance failures, if the grid controller fails 3 times within a 24 hour period, AppLogic does not automatically restart the grid controller. If this situation occurs, please contact your service provider immediately.
The automatic restart of the grid controller upon failure is supported on AppLogic grids of all sizes (single server, 2 servers, etc).
No matter what happens to the grid controller, the running applications should never be affected and should continue to operate.
Grid Controller Server Failure
In addition to automatically handling both grid controller and physical server failures, AppLogic can also handle failures of the grid controller server (i.e., the physical server where the grid controller is currently running; also known as the primary server). A controller server may fail for any number of reasons, the same as any other server within the grid.
In order to tolerate failures of the controller server, AppLogic uses server roles to define a set of servers in a grid that are able to run the grid controller in case of failures (i.e., backup controller servers). AppLogic uses the following server roles within a grid (these are automatically configured by AppLogic but may also be specified by a user or grid maintainer):
-
primary: Server that is currently running the AppLogic grid controller
-
secondary: Server that may run the AppLogic grid controller in case of a primary controller server failure
-
none: Server will never run the AppLogic grid controller and does not participate in controller high-availability
By default, every AppLogic grid is configured with the following server roles:
- 1 server grids: one primary server; single server grids cannot tolerate failures of the primary server (no grid controller server high-availability)
- 2 server grids: one primary and one secondary server (and one reference server, see below)
- 3+ server grids: one primary and two secondary servers; the remaining servers in the grid have their role set to
none and do not participate in the controller server failure recovery
If the primary server fails (hardware/software failure, powered-down, etc), one of the secondary servers automatically takes over as the new primary server for the grid. If the old primary server is restored to operation, it automatically becomes a non-primary server (secondary server). The new primary server starts the grid controller and once the controller is restored, the controller automatically reacquires the state of the grid. Just like for physical server failures, AppLogic also automatically restarts appliances that were running on the failed primary server. The use of secondary controller servers and the auto-restart of appliances allows AppLogic to tolerate failures of the primary controller server.
A user can view the server roles that are assigned in their grid using the srv list command. The server roles may also be modified using the srv set command. Grid maintainers may also use aldo to modify the server roles within their grids.
Note that for grids with exactly 2 servers, AppLogic requires a 3rd reference server in order to properly support the grid controller HA. By default when a 2.7+ grid is installed or upgraded, the AppLogic installer assigns the ALD server as the reference server for the grid. This setting may be overwritten by the maintainer, see the AppLogic installer documentation about the reference server for more details. The same ALD server may be used as a reference server for all grids on the same backbone.
Here are some important notes to keep in mind about AppLogic's grid controller high-availability:
-
In order for a grid to recover from a controller server failure, there must be at least 2 secondary servers up and running at the time of the server failure. If this requirement is not met (e.g., there is only one secondary server at the time of the primary server failure), the grid controller remains down and requires grid maintainer intervention in order to restore the grid controller to an operational state. If this type of controller failure is encountered, please contact your service provider for assistance.
- If the primary server fails and does not come back online for at least 2 hours, AppLogic automatically assigns a new secondary server within the grid in order to maintain at least 2 secondary servers for grid controller failover.
- In general when AppLogic schedules appliances to start on either
comp start or app start, the appliances are scheduled on servers based on both their role and available resources. AppLogic first tries to schedule appliances to run on servers with a role of none, then the primary server and lastly secondary servers. The secondary servers are used as a last resort for scheduling so there is a greater chance that there is available resources to start the controller if needed.
- When a secondary server takes over as the new primary server, if there are not enough resources available on the server to start the grid controller, AppLogic restarts appliances which are running on the new primary server on other servers within the grid so the grid controller can be started on the new primary server. Note that this may break appliance failover groups. If AppLogic stops one of these appliances it may not be able to restart the appliance on another server since there may not be enough resources to satisfy the failover group.
Visibility during controller server recovery
When the primary server fails, a user may point their browser to their grid controller host name/IP and observe the controller recovery progress. Once the controller has been recovered, the user is automatically redirected to the AppLogic GUI for their grid.
Authentication
The user must be authenticated in order to access the controller recovery GUI to observe the recovery progress/status. To log into the recovery GUI, click on the Login button, enter the recovery GUI password within the dialog box and click the OK button.
The recovery GUI password may be modified via the grid set command. Please note that a controller reboot is required for the new password to take effect.
Below is a snapshot of the controller recovery GUI authentication page:
Dashboard
Once the user has been authenticated, they will have access to the dashboard of the controller recovery GUI. Below is a snapshot of the controller recovery GUI dashboard:
The controller recovery GUI displays the following information:
- Dashboard
-
Grid Name: name of the grid
-
AppLogic Version: version of the grid
-
Status: displays that the recovery is in progress and which server is the new controller server
-
Role: current role of the new controller server; for controller recovery the role is always "Secondary"
-
Known Servers: list of all servers in the grid that contain the good streams for the controller system volumes
- Recovery in progress
-
Reason for failure: the reason why the controller server failed (may be unknown)
-
Process started: day and time when the controller recovery process started (when the recovery GUI has been started by AppLogic)
-
Current time: current time
-
Estimated completion: estimated completion time when the grid controller will be recovered
-
Remaining time: estimated time that is left to recover the grid controller
- typically it takes about 3 minutes to recover the grid controller
-
Details: this is the detail of the various stages for the actual grid controller recovery
- Messages
- this is used to log informational messages and warnings/errors that are encountered during the grid controller recovery
- please see the grid controller recovery messages for the list of messages that can be logged during the recovery process
Once the grid controller is restored, an alert is posted to the grid dashboard that describes the reason why the controller had failed.

The controller recovery GUI is also displayed during the boot of a grid so the boot process can be observed by the user (i.e.,
grid reboot). In this case,
Status would show
Starting the controller on primary server srvX.

If the controller recovery fails and the grid controller is not started, please contact your service provider immediately.
Application repair upon Grid Controller Recovery
When the grid controller fails, it is possible that at the time of the failure users were starting/stopping/restarting applications and components. Upon restoration of the grid controller, AppLogic ensures that all applications and components are restored to their expected state; based on the previous commands that were executing before the grid controller had failed. This process of restoring the application/component state is known as
repair. Both applications and components have an associated target state that is used in the repair process to make sure that they are properly restored.
As an example, if an application was in the middle of an application restart (
app restart) and right before the grid controller failure the application was stopping, AppLogic automatically ensures that the application is properly restarted. In this case, the application's target state is
RESTART_STOPPING to indicate that the application was stopping as part of an
app restart. The target state for an application can be obtained by executing
app info (the target state is only displayed for non-stopped applications).
Applications that are under repair after a grid controller restart may be in one of the following states:
-
REPAIRING: application is currently being repaired by AppLogic (components are being stopped/started as needed to restore the application to the appropriate target state)
-
RESTART_STOPPING: application is currently being stopped as part of an app restart as in the example above
While the application repair is in progress, the following alert is posted on the grid dashboard:
-
Grid recovery in progress: There were N active application(s) when the controller went down. M application(s) have been recovered. The state of P application(s) has been reacquired. Recovering Q application(s).
After the application repair is complete, the previous alert is destroyed and the following alert is posted on the grid dashboard (assuming everything was recovered successfully):
-
Grid recovery completed on time: There were N active application(s) when the grid controller went down. N application(s) have been recovered. The state of P application(s) has been reacquired.
If there was a failure recovering the applications, the following alert is posted on the grid dashboard:
-
Grid recovery completed on time: There were N active application(s) when the grid controller went down. M application(s) failed to be recovered.
If an application fails to be recovered, please use the
list log command to view the controller log for details regarding the failure. Usually applications fail to be recovered for one or more of the following reasons:
- not enough resources in the grid (cpu, memory, bandwidth)
- if one or more servers are down, it is possible that some of the application/appliance volumes are in an
ERROR state

During the automated application repair process, AppLogic does not allow the user/grid-maintainer to execute destructive CLI commands. This includes any command that affects the state of the grid or any server, application, component, class, catalog or volume. The following error message is displayed if a destructive command is executed during application repair:
-
Cannot execute command at this time - the grid controller is currently busy recovering from a failure.

Applications are repaired by AppLogic using the
app repair command. This command is valid only for applications that are in a
FAILED state. Users may execute this command directly to repair applications that may have failed (i.e., to restore an application where the user has completed the debugging of failed components).
Grid failures that require manual intervention

A particular grid failure can occur where the grid controller is not automatically restarted by AppLogic. Such cases as observed by the user are summarized below. If any of the following situations are encountered, please contact your service provider immediately.
- the grid controller is not restarted within a few minutes and there is no access to either the recovery GUI or the AppLogic GUI
- the recovery GUI is accessible and fails to restart the grid controller; in this case the reason why this has happened is specified in the recovery GUI
How long it takes to recover from various types of appliance/server/grid failures
The table below summarizes how long it can take to recover from various types of appliance, server and grid failures. The times specified below are estimates.
| Recovery Scenario | Recovery Time |
| Restart of an appliance due to an appliance failure (crash) | appliances are restarted right away |
| Restart of appliances on another server due to a server failure | appliances are restarted within 4 minutes |
| Auto-restart of the grid controller VM due to a grid controller VM failure | within 5 minutes for the grid controller to become operational |
| Auto-restart of the grid controller VM due to the grid controller becoming unresponsive | within 10 minutes for the grid controller to become operational |
| Auto-restart of the grid controller VM due to a failure of the primary server | within 5 minutes for the grid controller to become operational |
| Restart of the grid controller when specifying a new primary server for a grid (whether the grid is operational or unavailable) | within 5 minutes for the grid controller to become operational |

It takes AppLogic approximately 3 minutes to detect that a server has gone down (either from a physical server failure or a software failure/bug).
Detection of potential Application/Grid Issues
AppLogic proactively detects various types of issues with running applications, the grid and its servers. This allows AppLogic to alert the user/grid maintainer of potential problems that may subsequently endanger the availability of running applications. This section describes the types of issues AppLogic monitors/detects and what the user/grid maintainer can do to proactively avoid unneccessary application and grid downtime.

Please see the
grid dashboard messages topic for a full list of the alerts that can be posted to a grid's dashboard.
Application Recovery Issues
AppLogic detects the following problems that can cause subsequent application downtime upon the failure of one of the servers:
- insufficient grid resources; there are not enough available resources in the grid to restart one or more running appliances
- Dashboard alert:
There are not enough available resources to restart components running on n server(s) [list]
- What to do: Contact your service provider to add more servers to your grid so there are enough resources to recover from server failures.
- use of degraded volumes (volumes that are not in the
OK state); the failed server contains the one and only stream for one or more of the application's volumes
- Dashboard alert:
n running applications have degraded volumes [list]
- What to do: AppLogic takes care of this by itself; it automatically repairs these volumes in the background.
- use of a degraded shared catalog volume; this may cause a massive amount of downtime since the volume is shared across all class instances
- Dashboard alert:
n catalog class(es) with shared volumes have degraded volumes [list]
- What to do: AppLogic takes care of this by itself; it automatically repairs these volumes in the background.
- all servers in the grid are disabled which will cause application downtime upon a server failure
- Dashboard alert:
HA is unavailable due to the grid having no enabled servers.
- What to do: Enable one or more of the disabled servers using the
srv enable command.
Grid Controller Recovery Issues and Improper Grid Configuration
AppLogic detects the following grid controller recovery issues that could potentially cause the grid controller to become inaccessible in case of a grid controller server failure:
- grid does not have controller HA due to one or more of the grid controller servers being down
- Dashboard alert:
Grid does not have controller HA. X of Y controller servers are down. To restore controller HA, Z of the following controller servers have to be brought back online: list of servers
- What to do: Bring the specified servers back online or add new servers to the grid. Please contact your service provider for help.
- improper grid configuration for controller HA
- Dashboard alert:
The grid is not configured for controller HA; a secondary controller server needs to be assigned or else the grid cannot recover from grid controller server failures. Please assign one of the running servers as a secondary controller server in order to enable controller HA on the grid.
- What to do: There are no servers assigned to be a secondary grid controller (backup grid controller). Please contact your service provider immediately.
AppLogic detects the following improper grid configuration that could potentially cause grid failures or application downtime:
- single server grids do not have HA features
- Dashboard alert:
HA is unavailable due to the grid being a single server grid.
- What to do: Most of AppLogic's HA features require at least 2 servers. Please contact your service provider to add at least one more server to your grid to take advantage of the HA features described in this document.
- grid is not configured with the appropriate amount of controller memory, controller cpu or server memory
- Dashboard alert:
Grid resources are not configured correctly. This may lead to degradation in grid performance or grid instability. Please update the following grid resources on your grid or contact technical support: controller memory | controller CPU | server memory
- What to do: Contact your service provider immediately. The grid must be reconfigured to use the correct amount of resources or the grid might become unstable which may affect the uptime of running applications.
Grid Issues
AppLogic is able to detect various types of grid issues that may cause application start failures or other issues:
- AppLogic failed to cleanup a volume mount on one of the servers.
- Dashboard alert:
Failed to destroy mount ''. Unable to stop device . Please contact technical support.
- What to do: Contact your service provider immediately. This issue may cause application start failures.
- AppLogic failed to cleanup a volume share on one of the servers.
- Dashboard alert:
Failed to unshare volume stream ''. Unable to detach volume from . Please contact technical support.
- What to do: Contact your service provider immediately. This issue may cause application start failures.
- the NTP daemon that is used to synchronize the time between all servers, appliances and the grid controller has been restarted.
- Dashboard alert:
The NTP daemon was found not to be running on the server, but has been successfully restarted.
- What to do: This is just a warning message. The NTP daemon crashed or stopped working for some reason so AppLogic restarted the daemon. Contact your service provider.
- the NTP daemon that is used to synchronize the time between all servers and the grid controller is not running.
- Dashboard alert:
The NTP daemon was found not to be running on the server and could not be restarted. The time on the server and the time in the appliances running on the server will no longer be synchronized with the clock on the grid controller. Please contact technical support for assistance.
- What to do: Contact your service provider immediately. The times on the servers, appliances and grid controller may eventually become out of sync. The NTP daemon needs to be restarted manually by a maintainer.
Server Hardware Failure Detection
AppLogic detects the following hardware issues on the servers within a grid:
- the hard disk on a server is beginning to fail
- Dashboard alert:
Possible storage system failure on Server server. Error: Device: device, failure message
- What to do: The hard disk on the specified server is likely to fail and can potentially cause data loss. AppLogic automatically disables such servers so the server is not used for appliances or volumes. Contact your service provider immediately. The volumes and appliances need to be migrated off of the server and its hard disk needs to be replaced. See the storage failure reference for more information.

In a future AppLogic release, AppLogic will be able to detect and report a wide range of hardware issues on the servers (drastic temperature changes, NIC failures, etc).
--
EricT - 16 May 2009