Secondary server in failover fails to come out of recover state

Discussion:

Oscar Ricardo Silva

2013-04-30 18:34:29 UTC

OK, I've tried running the server in debug mode but I don't see any
additional information available. This happened again today. Also, as
previously suggested, I have raised the mclt from 120 to 300.

At 11am, a configuration change was made on the primary and it was
restarted. Here's the logs from the secondary and you'll see that at
11:06:55 both servers moved to a "normal" state.

Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to shutdown
Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to partner-down
Apr 30 11:00:24 secondary-dhcp dhcpd: peer dhcp: disconnected
Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from shutdown to recover
Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover to recover
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover to recover-done
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: I move from
partner-down to normal
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover-done to normal

At 11:07:42, the secondary was restarted and these are the only entries
in the log:

Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to partner-down
Apr 30 11:07:43 secondary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover

two hours later, the secondary server is still recovering.

Again, here's the strangest part of this issue: when I take down the
secondary server (dhcpd not running at all), the primary still reports
that the secondary is in recover mode. dhcpd was stopped on the
secondary at 13:07:08 and here's what the primary reports:

Apr 30 13:04:44 primary-dhcp dhcpd: peer dhcp: disconnected

$Tue Apr 30 13:14:38 CDT 2013

partner-state = 00:00:00:06
local-state = 00:00:00:04

There are router acls on interfaces between the two servers but the
networks on which each server resides is completely allowed without
restriction. iptables is running on each server but again, no
restrictions on communications between the two. If there was a firewall
issue then the servers would never have returned to a "normal" state
after the primary was restarted.

Time is perfectly sync'ed between the two servers.

Message: 2
Date: Thu, 25 Apr 2013 00:01:45 +0100
Subject: Re: Secondary server in failover fails to come out of recover
state
Content-Type: text/plain; charset="utf-8"
Can you crank up the logging level to debug (IIRC this needs to be done via
syslog) so it details exactly what it is doing when it goes into RECOVER
state, it may give some extra pointers.

partner-state = 00:00:00:06
local-state = 00:00:00:04
partner-state = 00:00:00:04
local-state = 00:00:00:06
In following another suggestion (recreate an empty dhcpd.leases file), I
partner-state = 00:00:00:06
local-state = 00:00:00:04
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option domain-name "dept.utexas.edu";
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.228.32 netmask 255.255.255.224 {
pool {
range 192.168.228.34 192.168.228.62;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
default-lease-time 7200;
max-lease-time 7200;
option domain-name "dept.utexas.edu";
option subnet-mask 255.255.255.224;
option broadcast-address 255.255.255.255;
option routers 192.168.228.33;
}
the new scopes were first added to the primary, it was then reloaded.
After both servers were in a "normal" state, the corresponding change was
made on the secondary and it was reloaded.
Per Stephen Carr's suggestion, I have increased the MCLT to 300 and both
servers are still in the same state.

We have two servers in a failover relationship, both running 4.1-ESV-R7.
After a reload of dhcpd on the secondary, it has not come out of the
recover state after almost an hour. We've had this happen with 3.1.3
and recently upgraded to this version. The only thing we've been able
to do is stop both instances of dhcpd and remove "my state" and "partner
state" from dhcpd.leases.
Here's the timeline of what happened.
1. A change was made to the configuration of the primary and dhcpd
reloaded at 15:39:14.
2. The primary moved back to a "normal" state at 15:43:42
Apr 24 15:39:14 primary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to partner-down
Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer update
completed.
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
recover to recover-done
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer moves from
partner-down to normal
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
recover-done to normal
Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to shutdown
Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: I move from
normal to partner-down
Apr 24 15:44:54 primary-dhcp dhcpd: peer dhcp: disconnected
Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
shutdown to recover
Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
recover to recover
3. The corresponding change was made on the secondary and it was
reloaded at 15:44:53
4. At 15:44:54 it came back up into recover, then moved from recover to
startup, then from startup to recover. That's where it's been ever since.
Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to partner-down
Apr 24 15:44:54 secondary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 24 15:45:56 secondary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 24 15:45:59 secondary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 86400;
max-lease-time 86400;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 120;
split 255;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}
include "/dhcpd/dhcpd.network.conf";
and the /dhcpd/dhcpd.network.conf file holds the scope definitions. Both
servers sync time through ntp and have the same exact time.
Any information would be appreciated.

______________________________**_________________
dhcp-users mailing list
https://lists.isc.org/mailman/**listinfo/dhcp-users<https://lists.isc.org/mailman/listinfo/dhcp-users>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/dhcp-users/attachments/20130425/c084fffc/attachment-0001.html>
------------------------------
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
End of dhcp-users Digest, Vol 54, Issue 21
******************************************

Steven Carr

2013-04-30 18:58:06 UTC