Multiple Watchdog handler



Recently, I've been having a problem with kernel panics beyond kernel 6.3.7 which causes a hard hang of the system.

So, the first thing to do was set up a watchdog to reset the system after 60 seconds with nothing feeding the it. At that point, the system would reset and wouldn't need me to manually reboot it each time.

The problem is, the default watchdog daemon can only handle a single watchdog - and I want to activate two.

Sounds like time for another simple perl script!

#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(nice);
$|++;

my @watchdogs = glob ("/dev/watchdog?");

## Find the lowest timeout...
print "Finding lowest watchdog timeout...\n";
my $sleep_timeout = 60;
my @wd_timeouts = glob("/sys/class/watchdog/*/timeout");
for my $wd_timeout ( @wd_timeouts ) {
    open my $fh, '<', $wd_timeout;
    my $timeout = do { local $/; <$fh> };
    close $fh;
    print "Timeout $wd_timeout = $timeout";
    if ( $timeout < $sleep_timeout ) {
        $sleep_timeout = $timeout;
    }
}

## Half the timeout to ensure reliability
$sleep_timeout = $sleep_timeout / 2;
print "Using final timeout of $sleep_timeout\n";

nice(-19);
$SIG{INT}  = \&signal_handler;
$SIG{TERM} = \&signal_handler;

## Open the file handles...
my @fhs;
for my $watchdog ( @watchdogs ) {
    print "Opening: $watchdog\n";
    open(my $fh, ">", $watchdog);
    $fh->autoflush(1);
    my $device = {
        device  => $watchdog,
        fh  => $fh,
    };
    push @fhs, $device;
}

## Start feeding the watchdogs.
while (1) {
    for my $watchdog ( @fhs ) {
        #print "Feeding: " . $watchdog->{"device"} . "\n";
        my $fh = $watchdog->{"fh"};
        print $fh ".\n";
    }
    #print "Sleeping $sleep_timeout seconds...\n";
    sleep $sleep_timeout;
}

sub signal_handler {
    for my $watchdog ( @fhs ) {
        print "Sending STOP to " . $watchdog->{"device"} . "\n";
        my $fh = $watchdog->{"fh"};
        print $fh "V";
    }
    exit 0;
}

This script will scan for the lowest timeout across all watchdogs installed in the system, and then feed them at 1/2 the watchdog timeout rate.

It can be started with a simple systemd unit:

[Unit]
Description=Run watchdog feeder

[Service]
Type=simple
ExecStart=/root/bin/watchdog.pl
Restart=always
CPUSchedulingPolicy=fifo
CPUSchedulingPriority=99

[Install]
WantedBy=multi-user.target

When the program stops, it sends the magic STOP command to the watchdog so a stopped service won't trigger a system reset.

Nice and simple.

Comments


Comments powered by Disqus