System Health Monitor

July 26, 2025 AT 10:43 AM (updated: 4 months ago)

The SystemHealthMonitor agent is a configurable service that actively monitors the host system's core resources, including CPU, memory, and disk usage. It is designed to periodically check these resources and send a status report if any of them exceed predefined thresholds.

⚙️ How it Works

How it Works The agent operates in a continuous loop, pausing for a configurable interval between each check. Check Resources: In each cycle, it uses the psutil library to get the current CPU load, virtual memory usage, and root disk space usage. Compare to Thresholds: It compares each resource's usage percentage against the cpu_threshold_percent, mem_threshold_percent, and disk_threshold_percent values set in its configuration. Send Report: If any resource exceeds its threshold, the agent constructs and sends a WARNING severity status report. This report is sent as a standard command packet to all agents that have the configured report_to_role. The packet is directed to the handler specified by report_handler on the receiving agent(s).

🧩 Configuration

The agent is driven by its config block in the boot directive. All monitoring thresholds and reporting settings can be customized.

Configurable Options:

check_interval_sec (Default: 60): The time in seconds between each resource check.

mem_threshold_percent (Default: 95.0): The memory usage percentage that triggers a report.

cpu_threshold_percent (Default: 90.0): The CPU load percentage that triggers a report.

disk_threshold_percent (Default: 95.0): The disk usage percentage that triggers a report.

report_to_role (Default: "hive.forensics.data_feed"): The role that status reports will be sent to.

report_handler (Default: "cmd_ingest_status_report"): The command handler that the receiving agent should use to process the report.

🧭 Directive

matrix_directive = {
    "universal_id": "matrix",
    "name": "matrix",
    "children": [
        {
            "universal_id": "forensic-alpha-1",
            "name": "forensic_detective"
            # This agent will receive the reports by default
        },
        {
            "universal_id": "sys-health-1",
            "name": "system_health",
            "config": {
                "check_interval_sec": 30,
                "mem_threshold_percent": 85.0,
                "cpu_threshold_percent": 80.0
            }
        }
    ]
}

📦 Source

#Authored by Daniel F MacDonald and Gemini
import sys
import os
import psutil

sys.path.insert(0, os.getenv("SITE_ROOT"))
sys.path.insert(0, os.getenv("AGENT_PATH"))

from matrixswarm.core.boot_agent import BootAgent
from matrixswarm.core.utils.swarm_sleep import interruptible_sleep
from matrixswarm.core.class_lib.packet_delivery.utility.encryption.utility.identity import IdentityObject

class Agent(BootAgent):
    """
    A config-driven MatrixSwarm agent that monitors system resources.
    It sends reports to the role defined in its configuration.
    """
    def __init__(self):
        """
        Initializes the agent and loads its configuration directly from the
        directive's tree_node, following the swarm's standard pattern.
        """
        super().__init__()
        self.name = "SystemHealthMonitor"

        # Get the agent's specific config dictionary from the global tree_node.
        config = self.tree_node.get("config", {})

        self.log("Initializing SystemHealthMonitor from directive config...")

        # Set attributes, using config values but keeping original defaults as fallbacks.
        self.mem_threshold = config.get("mem_threshold_percent", 95.0)
        self.cpu_threshold = config.get("cpu_threshold_percent", 90.0)
        self.disk_threshold = config.get("disk_threshold_percent", 95.0)
        self.check_interval_sec = config.get("check_interval_sec", 60)
        self.report_to_role = config.get("report_to_role", "hive.forensics.data_feed")
        self.report_handler = config.get("report_handler", "cmd_ingest_status_report")

        self.log(f"Monitoring configured: [Mem: {self.mem_threshold}%, CPU: {self.cpu_threshold}%, Disk: {self.disk_threshold}%]")
        self.log(f"Reporting to role '{self.report_to_role}' with handler '{self.report_handler}'")

    def send_status_report(self, service_name, status, severity, details):
        """Helper method to construct and send a status packet to the configured role."""
        pk_content = {
            "handler": self.report_handler,
            "content": {"source_agent": self.name, "service_name": service_name, "status": status, "details": details,
                        "severity": severity}
        }
        # Get destination nodes from the role defined in the config
        report_nodes = self.get_nodes_by_role(self.report_to_role)
        if not report_nodes:
            return

        pk = self.get_delivery_packet("standard.command.packet")
        pk.set_data(pk_content)
        for node in report_nodes:
            self.pass_packet(pk, node["universal_id"])

        if self.debug.is_enabled():
            self.log(f"Sent '{severity}' for '{service_name}' to role '{self.report_to_role}'", level="INFO")

    def worker(self, config: dict = None, identity: IdentityObject = None):
        """Main execution loop for the agent."""
        try:
            # Check Memory
            mem = psutil.virtual_memory()
            if mem.percent > self.mem_threshold:
                self.send_status_report("system.memory", "high_usage", "WARNING",
                                        f"Memory usage is critical: {mem.percent:.2f}%.")

            # Check CPU
            cpu = psutil.cpu_percent(interval=1)
            if cpu > self.cpu_threshold:
                self.send_status_report("system.cpu", "high_load", "WARNING", f"CPU load is critical: {cpu:.2f}%.")

            # Check Disk
            disk = psutil.disk_usage('/')
            if disk.percent > self.disk_threshold:
                self.send_status_report("system.disk", "low_space", "WARNING",
                                        f"Root disk space is critical: {disk.percent:.2f}% full.")
        except Exception as e:
            self.log(f"An error occurred while checking system resources: {e}", level="ERROR")

        interruptible_sleep(self, self.check_interval_sec)

if __name__ == "__main__":
    agent = Agent()
    agent.boot()

Comments 0

Category: monitoring

Tags: #health-check, #system-health, #devops, #resource-monitoring, #cpu-monitor, #memory-monitor, #disk-monitor, #psutil, #threshold-alerting, #status-report

Version: v1.0.0

Author: matrixswarm

Views: 73

Added: July 26, 2025

Updated: July 26, 2025