Discussion:
Secondary server in failover fails to come out of recover state
Oscar Ricardo Silva
2013-04-30 18:34:29 UTC
Permalink
OK, I've tried running the server in debug mode but I don't see any
additional information available. This happened again today. Also, as
previously suggested, I have raised the mclt from 120 to 300.


At 11am, a configuration change was made on the primary and it was
restarted. Here's the logs from the secondary and you'll see that at
11:06:55 both servers moved to a "normal" state.

Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to shutdown
Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to partner-down
Apr 30 11:00:24 secondary-dhcp dhcpd: peer dhcp: disconnected
Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from shutdown to recover
Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover to recover
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover to recover-done
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: I move from
partner-down to normal
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover-done to normal



At 11:07:42, the secondary was restarted and these are the only entries
in the log:

Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to partner-down
Apr 30 11:07:43 secondary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover

two hours later, the secondary server is still recovering.



Again, here's the strangest part of this issue: when I take down the
secondary server (dhcpd not running at all), the primary still reports
that the secondary is in recover mode. dhcpd was stopped on the
secondary at 13:07:08 and here's what the primary reports:

Apr 30 13:04:44 primary-dhcp dhcpd: peer dhcp: disconnected


$Tue Apr 30 13:14:38 CDT 2013

partner-state = 00:00:00:06
local-state = 00:00:00:04



There are router acls on interfaces between the two servers but the
networks on which each server resides is completely allowed without
restriction. iptables is running on each server but again, no
restrictions on communications between the two. If there was a firewall
issue then the servers would never have returned to a "normal" state
after the primary was restarted.

Time is perfectly sync'ed between the two servers.
Message: 2
Date: Thu, 25 Apr 2013 00:01:45 +0100
Subject: Re: Secondary server in failover fails to come out of recover
state
Content-Type: text/plain; charset="utf-8"
Can you crank up the logging level to debug (IIRC this needs to be done via
syslog) so it details exactly what it is doing when it goes into RECOVER
state, it may give some extra pointers.
partner-state = 00:00:00:06
local-state = 00:00:00:04
partner-state = 00:00:00:04
local-state = 00:00:00:06
In following another suggestion (recreate an empty dhcpd.leases file), I
partner-state = 00:00:00:06
local-state = 00:00:00:04
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option domain-name "dept.utexas.edu";
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.228.32 netmask 255.255.255.224 {
pool {
range 192.168.228.34 192.168.228.62;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
default-lease-time 7200;
max-lease-time 7200;
option domain-name "dept.utexas.edu";
option subnet-mask 255.255.255.224;
option broadcast-address 255.255.255.255;
option routers 192.168.228.33;
}
the new scopes were first added to the primary, it was then reloaded.
After both servers were in a "normal" state, the corresponding change was
made on the secondary and it was reloaded.
Per Stephen Carr's suggestion, I have increased the MCLT to 300 and both
servers are still in the same state.
We have two servers in a failover relationship, both running 4.1-ESV-R7.
After a reload of dhcpd on the secondary, it has not come out of the
recover state after almost an hour. We've had this happen with 3.1.3
and recently upgraded to this version. The only thing we've been able
to do is stop both instances of dhcpd and remove "my state" and "partner
state" from dhcpd.leases.
Here's the timeline of what happened.
1. A change was made to the configuration of the primary and dhcpd
reloaded at 15:39:14.
2. The primary moved back to a "normal" state at 15:43:42
Apr 24 15:39:14 primary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to partner-down
Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer update
completed.
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
recover to recover-done
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer moves from
partner-down to normal
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
recover-done to normal
Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to shutdown
Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: I move from
normal to partner-down
Apr 24 15:44:54 primary-dhcp dhcpd: peer dhcp: disconnected
Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
shutdown to recover
Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
recover to recover
3. The corresponding change was made on the secondary and it was
reloaded at 15:44:53
4. At 15:44:54 it came back up into recover, then moved from recover to
startup, then from startup to recover. That's where it's been ever since.
Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to partner-down
Apr 24 15:44:54 secondary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 24 15:45:56 secondary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 24 15:45:59 secondary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 86400;
max-lease-time 86400;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 120;
split 255;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}
include "/dhcpd/dhcpd.network.conf";
and the /dhcpd/dhcpd.network.conf file holds the scope definitions. Both
servers sync time through ntp and have the same exact time.
Any information would be appreciated.
______________________________**_________________
dhcp-users mailing list
https://lists.isc.org/mailman/**listinfo/dhcp-users<https://lists.isc.org/mailman/listinfo/dhcp-users>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/dhcp-users/attachments/20130425/c084fffc/attachment-0001.html>
------------------------------
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
End of dhcp-users Digest, Vol 54, Issue 21
******************************************
Steven Carr
2013-04-30 18:58:06 UTC
Permalink
Can you post the two full configs somewhere (or as near full as you can
without removing the main bulk of the config)? or feel free to email them
to me directly. Also, so I can try to reproduce in our lab what OS are you
running?
Post by Oscar Ricardo Silva
OK, I've tried running the server in debug mode but I don't see any
additional information available. This happened again today. Also, as
previously suggested, I have raised the mclt from 120 to 300.
At 11am, a configuration change was made on the primary and it was
restarted. Here's the logs from the secondary and you'll see that at
11:06:55 both servers moved to a "normal" state.
Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to shutdown
Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to partner-down
Apr 30 11:00:24 secondary-dhcp dhcpd: peer dhcp: disconnected
Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
shutdown to recover
Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover to recover
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover to recover-done
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: I move from
partner-down to normal
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover-done to normal
At 11:07:42, the secondary was restarted and these are the only entries in
Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to partner-down
Apr 30 11:07:43 secondary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
two hours later, the secondary server is still recovering.
Again, here's the strangest part of this issue: when I take down the
secondary server (dhcpd not running at all), the primary still reports that
the secondary is in recover mode. dhcpd was stopped on the secondary at
Apr 30 13:04:44 primary-dhcp dhcpd: peer dhcp: disconnected
$Tue Apr 30 13:14:38 CDT 2013
partner-state = 00:00:00:06
local-state = 00:00:00:04
There are router acls on interfaces between the two servers but the
networks on which each server resides is completely allowed without
restriction. iptables is running on each server but again, no restrictions
on communications between the two. If there was a firewall issue then the
servers would never have returned to a "normal" state after the primary was
restarted.
Time is perfectly sync'ed between the two servers.
Message: 2
Date: Thu, 25 Apr 2013 00:01:45 +0100
Subject: Re: Secondary server in failover fails to come out of recover
state
Content-Type: text/plain; charset="utf-8"
Can you crank up the logging level to debug (IIRC this needs to be done via
syslog) so it details exactly what it is doing when it goes into RECOVER
state, it may give some extra pointers.
Post by Oscar Ricardo Silva
partner-state = 00:00:00:06
local-state = 00:00:00:04
partner-state = 00:00:00:04
local-state = 00:00:00:06
In following another suggestion (recreate an empty dhcpd.leases file), I
partner-state = 00:00:00:06
local-state = 00:00:00:04
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option domain-name "dept.utexas.edu";
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.228.32 netmask 255.255.255.224 {
pool {
range 192.168.228.34 192.168.228.62;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
default-lease-time 7200;
max-lease-time 7200;
option domain-name "dept.utexas.edu";
option subnet-mask 255.255.255.224;
option broadcast-address 255.255.255.255;
option routers 192.168.228.33;
}
the new scopes were first added to the primary, it was then reloaded.
After both servers were in a "normal" state, the corresponding change was
made on the secondary and it was reloaded.
Per Stephen Carr's suggestion, I have increased the MCLT to 300 and both
servers are still in the same state.
We have two servers in a failover relationship, both running 4.1-ESV-R7.
After a reload of dhcpd on the secondary, it has not come out of the
recover state after almost an hour. We've had this happen with 3.1.3
and recently upgraded to this version. The only thing we've been able
to do is stop both instances of dhcpd and remove "my state" and "partner
state" from dhcpd.leases.
Here's the timeline of what happened.
1. A change was made to the configuration of the primary and dhcpd
reloaded at 15:39:14.
2. The primary moved back to a "normal" state at 15:43:42
Apr 24 15:39:14 primary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to partner-down
Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer update
completed.
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
recover to recover-done
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer moves from
partner-down to normal
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
recover-done to normal
Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to shutdown
Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: I move from
normal to partner-down
Apr 24 15:44:54 primary-dhcp dhcpd: peer dhcp: disconnected
Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
shutdown to recover
Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
recover to recover
3. The corresponding change was made on the secondary and it was
reloaded at 15:44:53
4. At 15:44:54 it came back up into recover, then moved from recover to
startup, then from startup to recover. That's where it's been ever since.
Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to partner-down
Apr 24 15:44:54 secondary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 24 15:45:56 secondary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 24 15:45:59 secondary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 86400;
max-lease-time 86400;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 120;
split 255;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}
include "/dhcpd/dhcpd.network.conf";
and the /dhcpd/dhcpd.network.conf file holds the scope definitions. Both
servers sync time through ntp and have the same exact time.
Any information would be appreciated.
______________________________****_________________
dhcp-users mailing list
https://lists.isc.org/mailman/****listinfo/dhcp-users<https://lists.isc.org/mailman/**listinfo/dhcp-users>
<https://**lists.isc.org/mailman/**listinfo/dhcp-users<https://lists.isc.org/mailman/listinfo/dhcp-users>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/**pipermail/dhcp-users/**
attachments/20130425/c084fffc/**attachment-0001.html<https://lists.isc.org/pipermail/dhcp-users/attachments/20130425/c084fffc/attachment-0001.html>
------------------------------
______________________________**_________________
dhcp-users mailing list
https://lists.isc.org/mailman/**listinfo/dhcp-users<https://lists.isc.org/mailman/listinfo/dhcp-users>
End of dhcp-users Digest, Vol 54, Issue 21
********************************************
______________________________**_________________
dhcp-users mailing list
https://lists.isc.org/mailman/**listinfo/dhcp-users<https://lists.isc.org/mailman/listinfo/dhcp-users>
Oscar Ricardo Silva
2013-04-30 19:41:35 UTC
Permalink
I should mention that we've written a patch to use the MAC address as
the identifier instead of the client identifier. We've done this so
that a device will have the same identifier no matter what operating
system it boots into. This is an issue for multi-boot devices or if a
devices boots into a PXE-boot environment and that's why you'll a this
line in the configuration:

key-off-mac-address true;



Operating system versions:

primary: RHEL 6.2, kernel 2.6.32-220.7.1.el6.i686
secondary: RHEL 6.3, kernel 2.6.32-279.19.1.el6.i686





Primary server:


option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 300;
split 255;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}




Secondary:

option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
secondary;
address 192.168.101.2;
port 520;
peer address 192.168.100.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
load balance max seconds 5;
}
subnet 192.168.101.0 netmask 255.255.255.224 {
}
Date: Tue, 30 Apr 2013 19:58:06 +0100
Subject: Re: Secondary server in failover fails to come out of recover
state
Content-Type: text/plain; charset="utf-8"
Can you post the two full configs somewhere (or as near full as you can
without removing the main bulk of the config)? or feel free to email them
to me directly. Also, so I can try to reproduce in our lab what OS
are you
running?
OK, I've tried running the server in debug mode but I don't see any
additional information available. This happened again today. Also, as
previously suggested, I have raised the mclt from 120 to 300.
At 11am, a configuration change was made on the primary and it was
restarted. Here's the logs from the secondary and you'll see that at
11:06:55 both servers moved to a "normal" state.
Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to shutdown
Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to partner-down
Apr 30 11:00:24 secondary-dhcp dhcpd: peer dhcp: disconnected
Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from shutdown to recover
Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover to recover
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover to recover-done
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: I move from
partner-down to normal
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover-done to normal
At 11:07:42, the secondary was restarted and these are the only entries
Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to partner-down
Apr 30 11:07:43 secondary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
two hours later, the secondary server is still recovering.
Again, here's the strangest part of this issue: when I take down the
secondary server (dhcpd not running at all), the primary still reports
that the secondary is in recover mode. dhcpd was stopped on the
Apr 30 13:04:44 primary-dhcp dhcpd: peer dhcp: disconnected
$Tue Apr 30 13:14:38 CDT 2013
partner-state = 00:00:00:06
local-state = 00:00:00:04
There are router acls on interfaces between the two servers but the
networks on which each server resides is completely allowed without
restriction. iptables is running on each server but again, no
restrictions on communications between the two. If there was a firewall
issue then the servers would never have returned to a "normal" state
after the primary was restarted.
Time is perfectly sync'ed between the two servers.
Message: 2
Date: Thu, 25 Apr 2013 00:01:45 +0100
Subject: Re: Secondary server in failover fails to come out of recover
state
Content-Type: text/plain; charset="utf-8"
Can you crank up the logging level to debug (IIRC this needs to be done via
syslog) so it details exactly what it is doing when it goes into RECOVER
state, it may give some extra pointers.
Post by Oscar Ricardo Silva
partner-state = 00:00:00:06
local-state = 00:00:00:04
partner-state = 00:00:00:04
local-state = 00:00:00:06
In following another suggestion (recreate an empty dhcpd.leases file), I
partner-state = 00:00:00:06
local-state = 00:00:00:04
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option domain-name "dept.utexas.edu";
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.228.32 netmask 255.255.255.224 {
pool {
range 192.168.228.34 192.168.228.62;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
default-lease-time 7200;
max-lease-time 7200;
option domain-name "dept.utexas.edu";
option subnet-mask 255.255.255.224;
option broadcast-address 255.255.255.255;
option routers 192.168.228.33;
}
the new scopes were first added to the primary, it was then reloaded.
After both servers were in a "normal" state, the corresponding change was
made on the secondary and it was reloaded.
Per Stephen Carr's suggestion, I have increased the MCLT to 300 and both
servers are still in the same state.
We have two servers in a failover relationship, both running 4.1-ESV-R7.
After a reload of dhcpd on the secondary, it has not come out of the
recover state after almost an hour. We've had this happen with 3.1.3
and recently upgraded to this version. The only thing we've been able
to do is stop both instances of dhcpd and remove "my state" and "partner
state" from dhcpd.leases.
Here's the timeline of what happened.
1. A change was made to the configuration of the primary and dhcpd
reloaded at 15:39:14.
2. The primary moved back to a "normal" state at 15:43:42
Apr 24 15:39:14 primary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to partner-down
Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer update
completed.
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
recover to recover-done
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer moves from
partner-down to normal
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
recover-done to normal
Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to shutdown
Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: I move from
normal to partner-down
Apr 24 15:44:54 primary-dhcp dhcpd: peer dhcp: disconnected
Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
shutdown to recover
Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
recover to recover
3. The corresponding change was made on the secondary and it was
reloaded at 15:44:53
4. At 15:44:54 it came back up into recover, then moved from recover to
startup, then from startup to recover. That's where it's been ever since.
Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to partner-down
Apr 24 15:44:54 secondary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 24 15:45:56 secondary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 24 15:45:59 secondary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 86400;
max-lease-time 86400;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 120;
split 255;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}
include "/dhcpd/dhcpd.network.conf";
and the /dhcpd/dhcpd.network.conf file holds the scope definitions. Both
servers sync time through ntp and have the same exact time.
Any information would be appreciated.
______________________________**_________________
dhcp-users mailing list
https://lists.isc.org/mailman/**listinfo/dhcp-users<https://lists.isc.org/mailman/listinfo/dhcp-users>
-------------- next part --------------
An HTML attachment was scrubbed...
<https://lists.isc.org/pipermail/dhcp-users/attachments/20130425/c084fffc/attachment-0001.html>
------------------------------
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
End of dhcp-users Digest, Vol 54, Issue 21
******************************************
Oscar Ricardo Silva
2013-04-30 19:50:07 UTC
Permalink
I should add that the only way we can fix this is to shutdown both
servers, remove the "my state" and "partner state" lines from this
section of dhcpd.leases :

failover peer "dhcp" state {
my state normal at 2 2013/04/30 18:39:28;
partner state normal at 0 2008/01/06 16:56:50;
}


and the start dhcpd on both. There doesn't appear to be any pattern to
when this happens. We can go for days or weeks with no issues, both
servers will sync up after a change but then after adding a new scope,
this issue may occur.
Post by Oscar Ricardo Silva
I should mention that we've written a patch to use the MAC address as
the identifier instead of the client identifier. We've done this so
that a device will have the same identifier no matter what operating
system it boots into. This is an issue for multi-boot devices or if a
devices boots into a PXE-boot environment and that's why you'll a this
key-off-mac-address true;
primary: RHEL 6.2, kernel 2.6.32-220.7.1.el6.i686
secondary: RHEL 6.3, kernel 2.6.32-279.19.1.el6.i686
option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 300;
split 255;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}
option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
secondary;
address 192.168.101.2;
port 520;
peer address 192.168.100.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
load balance max seconds 5;
}
subnet 192.168.101.0 netmask 255.255.255.224 {
}
Date: Tue, 30 Apr 2013 19:58:06 +0100
Subject: Re: Secondary server in failover fails to come out of recover
state
Content-Type: text/plain; charset="utf-8"
Can you post the two full configs somewhere (or as near full as you can
without removing the main bulk of the config)? or feel free to email
them
to me directly. Also, so I can try to reproduce in our lab what OS
are you
running?
OK, I've tried running the server in debug mode but I don't see any
additional information available. This happened again today. Also, as
previously suggested, I have raised the mclt from 120 to 300.
At 11am, a configuration change was made on the primary and it was
restarted. Here's the logs from the secondary and you'll see that at
11:06:55 both servers moved to a "normal" state.
Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to shutdown
Apr 30 11:00:23 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to partner-down
Apr 30 11:00:24 secondary-dhcp dhcpd: peer dhcp: disconnected
Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from shutdown to recover
Apr 30 11:03:36 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover to recover
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover to recover-done
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: I move from
partner-down to normal
Apr 30 11:06:55 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from recover-done to normal
At 11:07:42, the secondary was restarted and these are the only entries
Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 30 11:07:42 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to partner-down
Apr 30 11:07:43 secondary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 30 11:08:45 secondary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
two hours later, the secondary server is still recovering.
Again, here's the strangest part of this issue: when I take down the
secondary server (dhcpd not running at all), the primary still reports
that the secondary is in recover mode. dhcpd was stopped on the
Apr 30 13:04:44 primary-dhcp dhcpd: peer dhcp: disconnected
$Tue Apr 30 13:14:38 CDT 2013
partner-state = 00:00:00:06
local-state = 00:00:00:04
There are router acls on interfaces between the two servers but the
networks on which each server resides is completely allowed without
restriction. iptables is running on each server but again, no
restrictions on communications between the two. If there was a firewall
issue then the servers would never have returned to a "normal" state
after the primary was restarted.
Time is perfectly sync'ed between the two servers.
Message: 2
Date: Thu, 25 Apr 2013 00:01:45 +0100
Subject: Re: Secondary server in failover fails to come out of recover
state
Content-Type: text/plain; charset="utf-8"
Can you crank up the logging level to debug (IIRC this needs to be done via
syslog) so it details exactly what it is doing when it goes into RECOVER
state, it may give some extra pointers.
Post by Oscar Ricardo Silva
partner-state = 00:00:00:06
local-state = 00:00:00:04
partner-state = 00:00:00:04
local-state = 00:00:00:06
In following another suggestion (recreate an empty dhcpd.leases file), I
partner-state = 00:00:00:06
local-state = 00:00:00:04
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option domain-name "dept.utexas.edu";
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.228.32 netmask 255.255.255.224 {
pool {
range 192.168.228.34 192.168.228.62;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
default-lease-time 7200;
max-lease-time 7200;
option domain-name "dept.utexas.edu";
option subnet-mask 255.255.255.224;
option broadcast-address 255.255.255.255;
option routers 192.168.228.33;
}
the new scopes were first added to the primary, it was then reloaded.
After both servers were in a "normal" state, the corresponding change was
made on the secondary and it was reloaded.
Per Stephen Carr's suggestion, I have increased the MCLT to 300 and both
servers are still in the same state.
We have two servers in a failover relationship, both running 4.1-ESV-R7.
After a reload of dhcpd on the secondary, it has not come out of the
recover state after almost an hour. We've had this happen with 3.1.3
and recently upgraded to this version. The only thing we've been able
to do is stop both instances of dhcpd and remove "my state" and "partner
state" from dhcpd.leases.
Here's the timeline of what happened.
1. A change was made to the configuration of the primary and dhcpd
reloaded at 15:39:14.
2. The primary moved back to a "normal" state at 15:43:42
Apr 24 15:39:14 primary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to partner-down
Apr 24 15:39:15 primary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 24 15:40:18 primary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer update
completed.
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
recover to recover-done
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: peer moves from
partner-down to normal
Apr 24 15:43:42 primary-dhcp dhcpd: failover peer dhcp: I move from
recover-done to normal
Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to shutdown
Apr 24 15:44:53 primary-dhcp dhcpd: failover peer dhcp: I move from
normal to partner-down
Apr 24 15:44:54 primary-dhcp dhcpd: peer dhcp: disconnected
Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
shutdown to recover
Apr 24 15:45:59 primary-dhcp dhcpd: failover peer dhcp: peer moves from
recover to recover
3. The corresponding change was made on the secondary and it was
reloaded at 15:44:53
4. At 15:44:54 it came back up into recover, then moved from recover to
startup, then from startup to recover. That's where it's been ever since.
Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: I move from
normal to shutdown
Apr 24 15:44:53 secondary-dhcp dhcpd: failover peer dhcp: peer moves
from normal to partner-down
Apr 24 15:44:54 secondary-dhcp dhcpd: failover peer dhcp: I move from
shutdown to recover
Apr 24 15:45:56 secondary-dhcp dhcpd: failover peer dhcp: I move from
recover to startup
Apr 24 15:45:59 secondary-dhcp dhcpd: failover peer dhcp: I move from
startup to recover
option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 86400;
max-lease-time 86400;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 120;
split 255;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}
include "/dhcpd/dhcpd.network.conf";
and the /dhcpd/dhcpd.network.conf file holds the scope definitions. Both
servers sync time through ntp and have the same exact time.
Any information would be appreciated.
______________________________**_________________
dhcp-users mailing list
https://lists.isc.org/mailman/**listinfo/dhcp-users<https://lists.isc.org/mailman/listinfo/dhcp-users>
-------------- next part --------------
An HTML attachment was scrubbed...
<https://lists.isc.org/pipermail/dhcp-users/attachments/20130425/c084fffc/attachment-0001.html>
------------------------------
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
End of dhcp-users Digest, Vol 54, Issue 21
******************************************
Steven Carr
2013-04-30 20:37:50 UTC
Permalink
Can't see anything in the config that is suspect to be honest.

I assume you have a 'failover peer "dhcp";' statement inside each pool
statement? (that's why I asked for full config)

Personally I would change mclt to 3600 and spilt to 128 (there are
only a handful of situations where I would see split set to 0 or 255
the main one being when you have branch networks with a local DHCP
server and need a centralised "backup" DHCP incase the branch fails).

You could also try changing the port and peer port numbers (maybe
something >1024?) just on the off chance that it is being
blocked/terminated by something else, and it would be worth getting
packet captures going on each system to see exactly what comms are
happening between the two during the startup.

The only other thought I have is that it could be something to do with
the patch you have wrote. I'm not sure what impact this has had on the
data being written out to the leases file or being synchronised (you
might see this in a packet capture) but it could be choking on
something in that data that wasn't originally meant to be in there.

If you do change the split value then I would also flip the order of
domain-name-servers on the secondary server to load balance across the
two DNS servers, rather than dumping all queries on the first DNS
server.

Steve
Oscar Ricardo Silva
2013-05-10 21:00:49 UTC
Permalink
I have changed the split value to 128 and raised the MCLT to 300. After
the change, both servers were reloaded and came up normally. Twenty
minutes later, someone on staff made a change and the primary returned
to a normal state but then the secondary stayed in recover mode as we've
seen before.

Here's the logs, the configuration files (including some of the pool
statements). The primary is taken down at 15:09:13 and returns to
normal at 15:14:13. The secondary is then taken down at 15:14:44 but
then the last message was received at 15:15:47 (the logs were copied at
15:33:00)



Logs from primary:

15:09:13 primary-dhcp dhcpd: failover peer dhcp: I move from shutdown to
recover
15:10:13 primary-dhcp dhcpd: failover peer dhcp: I move from recover to
startup
15:10:13 primary-dhcp dhcpd: failover peer dhcp: I move from startup to
recover
15:13:31 primary-dhcp dhcpd: failover peer dhcp: peer update completed.
15:13:31 primary-dhcp dhcpd: failover peer dhcp: I move from recover to
recover-wait
15:14:13 primary-dhcp dhcpd: failover peer dhcp: I move from
recover-wait to recover-done
15:14:13 primary-dhcp dhcpd: failover peer dhcp: peer moves from
partner-down to normal
15:14:13 primary-dhcp dhcpd: failover peer dhcp: I move from
recover-done to normal
15:14:44 primary-dhcp dhcpd: failover peer dhcp: peer moves from normal
to shutdown
15:14:44 primary-dhcp dhcpd: failover peer dhcp: I move from normal to
partner-down
15:14:45 primary-dhcp dhcpd: peer dhcp: disconnected
15:15:47 primary-dhcp dhcpd: failover peer dhcp: peer moves from
shutdown to recover
15:15:47 primary-dhcp dhcpd: failover peer dhcp: peer moves from recover
to recover




Logs from secondary:

15:09:12 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to shutdown
15:09:12 secondary-dhcp dhcpd: failover peer dhcp: I move from normal to
partner-down
15:09:13 secondary-dhcp dhcpd: peer dhcp: disconnected
15:10:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
shutdown to recover
15:10:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover to recover
15:13:31 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover to recover-wait
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover-wait to recover-done
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: I move from
partner-down to normal
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover-done to normal
15:14:44 secondary-dhcp dhcpd: failover peer dhcp: I move from normal to
shutdown
15:14:44 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
normal to partner-down
15:14:45 secondary-dhcp dhcpd: failover peer dhcp: I move from shutdown
to recover
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from recover
to startup
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from startup
to recover





Primary:

option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 300;
split 128;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}


subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}

subnet 192.168.235.0 netmask 255.255.255.128 {
pool {
range 192.168.235.13 192.168.235.126;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.235.1;
}





Secondary:

option domain-name-servers 192.168.50.40, 192.168.50.41 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
secondary;
address 192.168.101.2;
port 520;
peer address 192.168.100.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
load balance max seconds 5;
}
subnet 192.168.101.0 netmask 255.255.255.224 {
}

subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}

subnet 192.168.235.0 netmask 255.255.255.128 {
pool {
range 192.168.235.13 192.168.235.126;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.235.1;
}
Post by Steven Carr
Can't see anything in the config that is suspect to be honest.
I assume you have a 'failover peer "dhcp";' statement inside each pool
statement? (that's why I asked for full config)
Personally I would change mclt to 3600 and spilt to 128 (there are
only a handful of situations where I would see split set to 0 or 255
the main one being when you have branch networks with a local DHCP
server and need a centralised "backup" DHCP incase the branch fails).
You could also try changing the port and peer port numbers (maybe
something >1024?) just on the off chance that it is being
blocked/terminated by something else, and it would be worth getting
packet captures going on each system to see exactly what comms are
happening between the two during the startup.
The only other thought I have is that it could be something to do with
the patch you have wrote. I'm not sure what impact this has had on the
data being written out to the leases file or being synchronised (you
might see this in a packet capture) but it could be choking on
something in that data that wasn't originally meant to be in there.
If you do change the split value then I would also flip the order of
domain-name-servers on the secondary server to load balance across the
two DNS servers, rather than dumping all queries on the first DNS
server.
Steve
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
Steven Carr
2013-05-11 11:51:35 UTC
Permalink
What happens in the logs at 15:20 when the servers should then have
passed the MCLT time and come out of recovery?

Why put the system into partner-down at 15:14, what was the reasoning
behind this given both servers were online?
I have changed the split value to 128 and raised the MCLT to 300. After the
change, both servers were reloaded and came up normally. Twenty minutes
later, someone on staff made a change and the primary returned to a normal
state but then the secondary stayed in recover mode as we've seen before.
Here's the logs, the configuration files (including some of the pool
statements). The primary is taken down at 15:09:13 and returns to normal at
15:14:13. The secondary is then taken down at 15:14:44 but then the last
message was received at 15:15:47 (the logs were copied at 15:33:00)
15:09:13 primary-dhcp dhcpd: failover peer dhcp: I move from shutdown to
recover
15:10:13 primary-dhcp dhcpd: failover peer dhcp: I move from recover to
startup
15:10:13 primary-dhcp dhcpd: failover peer dhcp: I move from startup to
recover
15:13:31 primary-dhcp dhcpd: failover peer dhcp: peer update completed.
15:13:31 primary-dhcp dhcpd: failover peer dhcp: I move from recover to
recover-wait
15:14:13 primary-dhcp dhcpd: failover peer dhcp: I move from recover-wait to
recover-done
15:14:13 primary-dhcp dhcpd: failover peer dhcp: peer moves from
partner-down to normal
15:14:13 primary-dhcp dhcpd: failover peer dhcp: I move from recover-done to
normal
15:14:44 primary-dhcp dhcpd: failover peer dhcp: peer moves from normal to
shutdown
15:14:44 primary-dhcp dhcpd: failover peer dhcp: I move from normal to
partner-down
15:14:45 primary-dhcp dhcpd: peer dhcp: disconnected
15:15:47 primary-dhcp dhcpd: failover peer dhcp: peer moves from shutdown to
recover
15:15:47 primary-dhcp dhcpd: failover peer dhcp: peer moves from recover to
recover
15:09:12 secondary-dhcp dhcpd: failover peer dhcp: peer moves from normal to
shutdown
15:09:12 secondary-dhcp dhcpd: failover peer dhcp: I move from normal to
partner-down
15:09:13 secondary-dhcp dhcpd: peer dhcp: disconnected
15:10:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from shutdown
to recover
15:10:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from recover
to recover
15:13:31 secondary-dhcp dhcpd: failover peer dhcp: peer moves from recover
to recover-wait
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover-wait to recover-done
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: I move from partner-down
to normal
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover-done to normal
15:14:44 secondary-dhcp dhcpd: failover peer dhcp: I move from normal to
shutdown
15:14:44 secondary-dhcp dhcpd: failover peer dhcp: peer moves from normal to
partner-down
15:14:45 secondary-dhcp dhcpd: failover peer dhcp: I move from shutdown to
recover
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from recover to
startup
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from startup to
recover
option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 300;
split 128;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.235.0 netmask 255.255.255.128 {
pool {
range 192.168.235.13 192.168.235.126;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.235.1;
}
option domain-name-servers 192.168.50.40, 192.168.50.41 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
secondary;
address 192.168.101.2;
port 520;
peer address 192.168.100.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
load balance max seconds 5;
}
subnet 192.168.101.0 netmask 255.255.255.224 {
}
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.235.0 netmask 255.255.255.128 {
pool {
range 192.168.235.13 192.168.235.126;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.235.1;
}
Post by Steven Carr
Can't see anything in the config that is suspect to be honest.
I assume you have a 'failover peer "dhcp";' statement inside each pool
statement? (that's why I asked for full config)
Personally I would change mclt to 3600 and spilt to 128 (there are
only a handful of situations where I would see split set to 0 or 255
the main one being when you have branch networks with a local DHCP
server and need a centralised "backup" DHCP incase the branch fails).
You could also try changing the port and peer port numbers (maybe
something >1024?) just on the off chance that it is being
blocked/terminated by something else, and it would be worth getting
packet captures going on each system to see exactly what comms are
happening between the two during the startup.
The only other thought I have is that it could be something to do with
the patch you have wrote. I'm not sure what impact this has had on the
data being written out to the leases file or being synchronised (you
might see this in a packet capture) but it could be choking on
something in that data that wasn't originally meant to be in there.
If you do change the split value then I would also flip the order of
domain-name-servers on the secondary server to load balance across the
two DNS servers, rather than dumping all queries on the first DNS
server.
Steve
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
Steven Carr
2013-05-11 12:22:27 UTC
Permalink
Also what exactly is the process that happens when someone makes a change?

Looking at those logs I'm missing a "COMMUNICATIONS-INTERRUPTED"
stage, which I would expect to see when one of the peers reloads, and
I shouldn't be seeing any "PARTNER-DOWN" messages as the neither peer
should be placed in the partner down mode, neither of the partners are
down (down being offline for longer than MCLT).
Post by Steven Carr
What happens in the logs at 15:20 when the servers should then have
passed the MCLT time and come out of recovery?
Why put the system into partner-down at 15:14, what was the reasoning
behind this given both servers were online?
I have changed the split value to 128 and raised the MCLT to 300. After the
change, both servers were reloaded and came up normally. Twenty minutes
later, someone on staff made a change and the primary returned to a normal
state but then the secondary stayed in recover mode as we've seen before.
Here's the logs, the configuration files (including some of the pool
statements). The primary is taken down at 15:09:13 and returns to normal at
15:14:13. The secondary is then taken down at 15:14:44 but then the last
message was received at 15:15:47 (the logs were copied at 15:33:00)
15:09:13 primary-dhcp dhcpd: failover peer dhcp: I move from shutdown to
recover
15:10:13 primary-dhcp dhcpd: failover peer dhcp: I move from recover to
startup
15:10:13 primary-dhcp dhcpd: failover peer dhcp: I move from startup to
recover
15:13:31 primary-dhcp dhcpd: failover peer dhcp: peer update completed.
15:13:31 primary-dhcp dhcpd: failover peer dhcp: I move from recover to
recover-wait
15:14:13 primary-dhcp dhcpd: failover peer dhcp: I move from recover-wait to
recover-done
15:14:13 primary-dhcp dhcpd: failover peer dhcp: peer moves from
partner-down to normal
15:14:13 primary-dhcp dhcpd: failover peer dhcp: I move from recover-done to
normal
15:14:44 primary-dhcp dhcpd: failover peer dhcp: peer moves from normal to
shutdown
15:14:44 primary-dhcp dhcpd: failover peer dhcp: I move from normal to
partner-down
15:14:45 primary-dhcp dhcpd: peer dhcp: disconnected
15:15:47 primary-dhcp dhcpd: failover peer dhcp: peer moves from shutdown to
recover
15:15:47 primary-dhcp dhcpd: failover peer dhcp: peer moves from recover to
recover
15:09:12 secondary-dhcp dhcpd: failover peer dhcp: peer moves from normal to
shutdown
15:09:12 secondary-dhcp dhcpd: failover peer dhcp: I move from normal to
partner-down
15:09:13 secondary-dhcp dhcpd: peer dhcp: disconnected
15:10:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from shutdown
to recover
15:10:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from recover
to recover
15:13:31 secondary-dhcp dhcpd: failover peer dhcp: peer moves from recover
to recover-wait
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover-wait to recover-done
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: I move from partner-down
to normal
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover-done to normal
15:14:44 secondary-dhcp dhcpd: failover peer dhcp: I move from normal to
shutdown
15:14:44 secondary-dhcp dhcpd: failover peer dhcp: peer moves from normal to
partner-down
15:14:45 secondary-dhcp dhcpd: failover peer dhcp: I move from shutdown to
recover
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from recover to
startup
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from startup to
recover
option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 300;
split 128;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.235.0 netmask 255.255.255.128 {
pool {
range 192.168.235.13 192.168.235.126;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.235.1;
}
option domain-name-servers 192.168.50.40, 192.168.50.41 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
secondary;
address 192.168.101.2;
port 520;
peer address 192.168.100.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
load balance max seconds 5;
}
subnet 192.168.101.0 netmask 255.255.255.224 {
}
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.235.0 netmask 255.255.255.128 {
pool {
range 192.168.235.13 192.168.235.126;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.235.1;
}
Post by Steven Carr
Can't see anything in the config that is suspect to be honest.
I assume you have a 'failover peer "dhcp";' statement inside each pool
statement? (that's why I asked for full config)
Personally I would change mclt to 3600 and spilt to 128 (there are
only a handful of situations where I would see split set to 0 or 255
the main one being when you have branch networks with a local DHCP
server and need a centralised "backup" DHCP incase the branch fails).
You could also try changing the port and peer port numbers (maybe
something >1024?) just on the off chance that it is being
blocked/terminated by something else, and it would be worth getting
packet captures going on each system to see exactly what comms are
happening between the two during the startup.
The only other thought I have is that it could be something to do with
the patch you have wrote. I'm not sure what impact this has had on the
data being written out to the leases file or being synchronised (you
might see this in a packet capture) but it could be choking on
something in that data that wasn't originally meant to be in there.
If you do change the split value then I would also flip the order of
domain-name-servers on the secondary server to load balance across the
two DNS servers, rather than dumping all queries on the first DNS
server.
Steve
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
Oscar Ricardo Silva
2013-05-13 17:43:14 UTC
Permalink
Right before 15:09 we made a configuration change on the primary, adding
a new network definition. In order for that new network to be active we
had to restart dhcpd on the primary. The portion of the config file
holding all the network definitions was copied to the secondary. Once
both servers were online and "normal", dhcpd was restarted on the
secondary so it would then be able to provide services on that new network.

Maybe we're doing something wrong here but it's my understanding that
after a configuration change, dhcpd has to be restarted.

Just in case it's a timing issue I've told coworkers to wait an
additional 10 minutes after the two servers go into "normal" before
restarting the primary.
Post by Steven Carr
What happens in the logs at 15:20 when the servers should then have
passed the MCLT time and come out of recovery?
Why put the system into partner-down at 15:14, what was the reasoning
behind this given both servers were online?
I have changed the split value to 128 and raised the MCLT to 300. After the
change, both servers were reloaded and came up normally. Twenty minutes
later, someone on staff made a change and the primary returned to a normal
state but then the secondary stayed in recover mode as we've seen before.
Here's the logs, the configuration files (including some of the pool
statements). The primary is taken down at 15:09:13 and returns to normal at
15:14:13. The secondary is then taken down at 15:14:44 but then the last
message was received at 15:15:47 (the logs were copied at 15:33:00)
15:09:13 primary-dhcp dhcpd: failover peer dhcp: I move from shutdown to
recover
15:10:13 primary-dhcp dhcpd: failover peer dhcp: I move from recover to
startup
15:10:13 primary-dhcp dhcpd: failover peer dhcp: I move from startup to
recover
15:13:31 primary-dhcp dhcpd: failover peer dhcp: peer update completed.
15:13:31 primary-dhcp dhcpd: failover peer dhcp: I move from recover to
recover-wait
15:14:13 primary-dhcp dhcpd: failover peer dhcp: I move from recover-wait to
recover-done
15:14:13 primary-dhcp dhcpd: failover peer dhcp: peer moves from
partner-down to normal
15:14:13 primary-dhcp dhcpd: failover peer dhcp: I move from recover-done to
normal
15:14:44 primary-dhcp dhcpd: failover peer dhcp: peer moves from normal to
shutdown
15:14:44 primary-dhcp dhcpd: failover peer dhcp: I move from normal to
partner-down
15:14:45 primary-dhcp dhcpd: peer dhcp: disconnected
15:15:47 primary-dhcp dhcpd: failover peer dhcp: peer moves from shutdown to
recover
15:15:47 primary-dhcp dhcpd: failover peer dhcp: peer moves from recover to
recover
15:09:12 secondary-dhcp dhcpd: failover peer dhcp: peer moves from normal to
shutdown
15:09:12 secondary-dhcp dhcpd: failover peer dhcp: I move from normal to
partner-down
15:09:13 secondary-dhcp dhcpd: peer dhcp: disconnected
15:10:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from shutdown
to recover
15:10:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from recover
to recover
15:13:31 secondary-dhcp dhcpd: failover peer dhcp: peer moves from recover
to recover-wait
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover-wait to recover-done
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: I move from partner-down
to normal
15:14:13 secondary-dhcp dhcpd: failover peer dhcp: peer moves from
recover-done to normal
15:14:44 secondary-dhcp dhcpd: failover peer dhcp: I move from normal to
shutdown
15:14:44 secondary-dhcp dhcpd: failover peer dhcp: peer moves from normal to
partner-down
15:14:45 secondary-dhcp dhcpd: failover peer dhcp: I move from shutdown to
recover
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from recover to
startup
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from startup to
recover
option domain-name-servers 192.168.50.41, 192.168.50.40 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
primary;
address 192.168.100.2;
port 520;
peer address 192.168.101.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 300;
split 128;
load balance max seconds 5;
}
subnet 192.168.100.0 netmask 255.255.255.224 {
}
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.235.0 netmask 255.255.255.128 {
pool {
range 192.168.235.13 192.168.235.126;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.235.1;
}
option domain-name-servers 192.168.50.40, 192.168.50.41 ;
option ntp-servers 192.168.50.40, 192.168.50.41;
default-lease-time 172800;
max-lease-time 172800;
one-lease-per-client true;
ddns-update-style ad-hoc;
ddns-updates off;
authoritative;
key-off-mac-address true;
if substring (option dhcp-client-identifier, 0, 5) = 01:52:41:53:20 {
deny booting;
}
option voip-tftp-server-address code 150 = array of ip-address ;
set vendor-string = option vendor-class-identifier;
failover peer "dhcp" {
secondary;
address 192.168.101.2;
port 520;
peer address 192.168.100.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
load balance max seconds 5;
}
subnet 192.168.101.0 netmask 255.255.255.224 {
}
subnet 192.168.75.128 netmask 255.255.255.128 {
pool {
range 192.168.75.130 192.168.75.254;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.75.129;
}
subnet 192.168.235.0 netmask 255.255.255.128 {
pool {
range 192.168.235.13 192.168.235.126;
deny dynamic bootp clients ;
failover peer "dhcp" ;
}
option subnet-mask 255.255.255.128;
option broadcast-address 255.255.255.255;
option routers 192.168.235.1;
}
Post by Steven Carr
Can't see anything in the config that is suspect to be honest.
I assume you have a 'failover peer "dhcp";' statement inside each pool
statement? (that's why I asked for full config)
Personally I would change mclt to 3600 and spilt to 128 (there are
only a handful of situations where I would see split set to 0 or 255
the main one being when you have branch networks with a local DHCP
server and need a centralised "backup" DHCP incase the branch fails).
You could also try changing the port and peer port numbers (maybe
something >1024?) just on the off chance that it is being
blocked/terminated by something else, and it would be worth getting
packet captures going on each system to see exactly what comms are
happening between the two during the startup.
The only other thought I have is that it could be something to do with
the patch you have wrote. I'm not sure what impact this has had on the
data being written out to the leases file or being synchronised (you
might see this in a packet capture) but it could be choking on
something in that data that wasn't originally meant to be in there.
If you do change the split value then I would also flip the order of
domain-name-servers on the secondary server to load balance across the
two DNS servers, rather than dumping all queries on the first DNS
server.
Steve
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
Oscar Ricardo Silva
2013-05-13 18:05:09 UTC
Permalink
Regarding what happens in the logs at 15:20, there's nothing on the
primary other than incoming requests from clients. On the secondary I saw:

15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from recover
to startup
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from startup
to recover
15:15:47 secondary-dhcp dhcpd: Sent update request all message to dhcp
15:18:33 secondary-dhcp dhcpd: dhcp: ignored (recovering)
15:20:02 secondary-dhcp dhcpd: dhcp: ignored (recovering)


meanwhile, requests were coming in from the clients. Here's a fuller
picture of that section of the log where you'll see the secondary
continues to note that it's recovering.


15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.109 from
00:50:56:92:00:5d (NUR-022) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.16864.218 from
00:80:77:78:09:11 (BRN_780911) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPINFORM from 192.168.186.103 via
192.168.166.250
15:20:01 secondary-dhcp dhcpd: DHCPACK to 192.168.186.103
(00:1e:c9:37:43:1f) via em2
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.218 from
00:50:56:92:00:45 (LAW-KSK-006) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.162 from
00:50:56:92:0d:b6 (FAC227-010) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.128 from
00:50:56:92:03:1e (FAC-027) via em2: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.194.238 via
192.168.194.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.194.238
(00:18:8b:6e:32:0f) via em2
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0b:db:59:f9:80 via
192.168.39.97: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: dhcp: ignored (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from d0:d0:fd:2a:01:57
(ATS-ACLP-UTDELL) via 192.168.226.1: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.226.19
(172.16.200.2) from d0:d0:fd:2a:01:57 (ATS-ACLP-UTDELL) via
192.168.226.1: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0d:56:ee:13:94 via
192.168.180.1: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.176.131 via
192.168.176.1
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.176.131
(00:24:e8:fc:43:d2) via em2
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.222.121 via
192.168.82.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.222.121
(00:23:ae:a1:69:38) via em2
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168155.193 via
192.168155.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168155.193
(00:26:6c:25:47:97) via em2
15:20:02 secondary-dhcp dhcpd: DHCPREQUEST for 192.16873.186 from
00:1f:5b:ea:97:36 via 192.16873.129: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: BOOTREQUEST from 2c:41:38:7c:c6:e0 via
192.168.164.250: BOOTP from dynamic client and no dynamic leases
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.242.159 via
192.168.242.250
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.242.159
(60:eb:69:32:93:2e) via em2
15:20:03 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.31 from
00:50:56:92:01:70 (FAC227-011) via em2: not responding (recovering)
15:20:03 secondary-dhcp dhcpd: DHCPINFORM from 192.168119.9 via
192.16896.129
15:20:03 secondary-dhcp dhcpd: DHCPACK to 192.168119.9
(d4:85:64:b5:46:f1) via em2
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.44.81 via
192.168.44.1
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.44.81
(00:23:ae:55:f4:86) via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.220 from
00:50:56:92:00:9c (LAW-046) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 172.16.226.236 from
00:26:bb:5f:96:12 (fac-809982) via 192.168.165.129: not responding
(recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 78:45:c4:0b:43:e3
(heco-pelab03) via 192.168.64.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 00:26:bb:5f:96:12 via
192.168.165.129: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.191.83 via
192.168.191.1
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.191.83
(d4:be:d9:c6:0c:f7) via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.213.218 from
90:b1:1c:64:6d:d3 (CM-M3310-867716) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 78:45:c4:1e:55:03
(HECO-PELAB08) via 192.168.64.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.169.111 from
00:50:56:92:0a:fb (CSE-LAB-008) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.198.201 from
00:1f:f3:45:36:7a (LA-ITS827888WD) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.186.43 via
192.168.166.250
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.186.43
(00:22:19:1c:c8:1b) via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.16873.186 from
00:1f:5b:ea:97:36 via 192.16873.129: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.55.189 from
3c:07:54:1c:c8:09 via 192.168.55.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.120 from
00:50:56:92:0d:a6 (NUR-045) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.67.180 from
40:6c:8f:3f:84:50 via 192.168.67.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.67.180 from
40:6c:8f:3f:84:50 via 192.168.67.250: not responding (recovering)
...
15:29:54 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0a:95:36:5c:93
(RUNIN-TOP) via 192.168.230.1: not responding (recovering)
15:29:59 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0a:95:35:f3:77
(Xserve) via 192.168.93.193: not responding (recovering)
Post by Oscar Ricardo Silva
Right before 15:09 we made a configuration change on the primary, adding
a new network definition. In order for that new network to be active we
had to restart dhcpd on the primary. The portion of the config file
holding all the network definitions was copied to the secondary. Once
both servers were online and "normal", dhcpd was restarted on the
secondary so it would then be able to provide services on that new network.
Maybe we're doing something wrong here but it's my understanding that
after a configuration change, dhcpd has to be restarted.
Just in case it's a timing issue I've told coworkers to wait an
additional 10 minutes after the two servers go into "normal" before
restarting the primary.
Post by Steven Carr
What happens in the logs at 15:20 when the servers should then have
passed the MCLT time and come out of recovery?
Why put the system into partner-down at 15:14, what was the reasoning
behind this given both servers were online?
...
Steven Carr
2013-05-13 18:20:03 UTC
Permalink
So the only thing I can think of at this stage is that the secondary
is failing to synchronise after it comes back online, which is why
it's staying in recovery mode.

Could there be anything blocking port 520? Do these boxes have SELinux
running? When this happens again can you run a packet capture on the
primary and secondary and see what is being sent/received to port 520?
You could also try changing this port? (the ports which seem to
recommended these days are 647 for the primary and 847 for the
secondary)

Instead of issuing a restart, maybe do a shutdown, wait ~15 seconds
and then startup dhcpd, just incase something is holding open the port
so when DHCP comes back up it can't open the port, I would expect this
to be logged somewhere but if only the team that developed BIND
logging channels spent some time implementing detailed logging
channels on DHCPD...

HTH
Regarding what happens in the logs at 15:20, there's nothing on the primary
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from recover to
startup
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from startup to
recover
15:15:47 secondary-dhcp dhcpd: Sent update request all message to dhcp
15:18:33 secondary-dhcp dhcpd: dhcp: ignored (recovering)
15:20:02 secondary-dhcp dhcpd: dhcp: ignored (recovering)
meanwhile, requests were coming in from the clients. Here's a fuller picture
of that section of the log where you'll see the secondary continues to note
that it's recovering.
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.109 from
00:50:56:92:00:5d (NUR-022) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.16864.218 from
00:80:77:78:09:11 (BRN_780911) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPINFORM from 192.168.186.103 via
192.168.166.250
15:20:01 secondary-dhcp dhcpd: DHCPACK to 192.168.186.103
(00:1e:c9:37:43:1f) via em2
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.218 from
00:50:56:92:00:45 (LAW-KSK-006) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.162 from
00:50:56:92:0d:b6 (FAC227-010) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.128 from
00:50:56:92:03:1e (FAC-027) via em2: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.194.238 via
192.168.194.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.194.238
(00:18:8b:6e:32:0f) via em2
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0b:db:59:f9:80 via
192.168.39.97: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: dhcp: ignored (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from d0:d0:fd:2a:01:57
(ATS-ACLP-UTDELL) via 192.168.226.1: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.226.19 (172.16.200.2)
from d0:d0:fd:2a:01:57 (ATS-ACLP-UTDELL) via 192.168.226.1: not responding
(recovering)
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0d:56:ee:13:94 via
192.168.180.1: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.176.131 via
192.168.176.1
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.176.131
(00:24:e8:fc:43:d2) via em2
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.222.121 via
192.168.82.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.222.121
(00:23:ae:a1:69:38) via em2
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168155.193 via
192.168155.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168155.193 (00:26:6c:25:47:97)
via em2
15:20:02 secondary-dhcp dhcpd: DHCPREQUEST for 192.16873.186 from
00:1f:5b:ea:97:36 via 192.16873.129: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: BOOTREQUEST from 2c:41:38:7c:c6:e0 via
192.168.164.250: BOOTP from dynamic client and no dynamic leases
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.242.159 via
192.168.242.250
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.242.159
(60:eb:69:32:93:2e) via em2
15:20:03 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.31 from
00:50:56:92:01:70 (FAC227-011) via em2: not responding (recovering)
15:20:03 secondary-dhcp dhcpd: DHCPINFORM from 192.168119.9 via
192.16896.129
15:20:03 secondary-dhcp dhcpd: DHCPACK to 192.168119.9 (d4:85:64:b5:46:f1)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.44.81 via
192.168.44.1
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.44.81 (00:23:ae:55:f4:86)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.220 from
00:50:56:92:00:9c (LAW-046) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 172.16.226.236 from
00:26:bb:5f:96:12 (fac-809982) via 192.168.165.129: not responding
(recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 78:45:c4:0b:43:e3
(heco-pelab03) via 192.168.64.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 00:26:bb:5f:96:12 via
192.168.165.129: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.191.83 via
192.168.191.1
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.191.83 (d4:be:d9:c6:0c:f7)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.213.218 from
90:b1:1c:64:6d:d3 (CM-M3310-867716) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 78:45:c4:1e:55:03
(HECO-PELAB08) via 192.168.64.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.169.111 from
00:50:56:92:0a:fb (CSE-LAB-008) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.198.201 from
00:1f:f3:45:36:7a (LA-ITS827888WD) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.186.43 via
192.168.166.250
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.186.43 (00:22:19:1c:c8:1b)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.16873.186 from
00:1f:5b:ea:97:36 via 192.16873.129: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.55.189 from
3c:07:54:1c:c8:09 via 192.168.55.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.120 from
00:50:56:92:0d:a6 (NUR-045) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.67.180 from
40:6c:8f:3f:84:50 via 192.168.67.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.67.180 from
40:6c:8f:3f:84:50 via 192.168.67.250: not responding (recovering)
...
15:29:54 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0a:95:36:5c:93
(RUNIN-TOP) via 192.168.230.1: not responding (recovering)
15:29:59 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0a:95:35:f3:77 (Xserve)
via 192.168.93.193: not responding (recovering)
Post by Oscar Ricardo Silva
Right before 15:09 we made a configuration change on the primary, adding
a new network definition. In order for that new network to be active we
had to restart dhcpd on the primary. The portion of the config file
holding all the network definitions was copied to the secondary. Once
both servers were online and "normal", dhcpd was restarted on the
secondary so it would then be able to provide services on that new network.
Maybe we're doing something wrong here but it's my understanding that
after a configuration change, dhcpd has to be restarted.
Just in case it's a timing issue I've told coworkers to wait an
additional 10 minutes after the two servers go into "normal" before
restarting the primary.
Post by Steven Carr
What happens in the logs at 15:20 when the servers should then have
passed the MCLT time and come out of recovery?
Why put the system into partner-down at 15:14, what was the reasoning
behind this given both servers were online?
...
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
Oscar Ricardo Silva
2013-05-15 21:12:37 UTC
Permalink
I really, REALLY wish there was a better way of debugging the state
between servers ... some way to debug the messages being exchanged
between the members of a failover pair ...

I checked and double-checked the ACLs between the servers as well as
iptables running on each server. The entire /24 on which each server
resides is explicitly allowed on both. I also double-checked the state
of SELinux and it is completely disabled.



What I've done now is changed the ports for communication between the
two. The primary is now listening on 647 and the secondary on 847:

Primary:
port 647;
peer port 847;

Secondary:
port 847;
peer port 647;




As for the "restart" process, we do actually shutdown the server, wait
for the PID to be removed, then restart it. We use this little script to
take it down gracefully:

#!/bin/sh
omshell << EOF
port 7911
key omkey XYZXYZXYZ
connect
new control
open
set state = 2
update
EOF


then wait for the PID to be removed:

#Start looping and testing for the absence of the PID.
while ps -p $DHCPPROC >/dev/null; do
sleep .5
done
sleep 1


Then start /usr/sbin/dhcpd. I'll add an extra 15 second delay before
the service is restarted.


I appreciate you taking an interest and the suggestions.
Post by Steven Carr
So the only thing I can think of at this stage is that the secondary
is failing to synchronise after it comes back online, which is why
it's staying in recovery mode.
Could there be anything blocking port 520? Do these boxes have SELinux
running? When this happens again can you run a packet capture on the
primary and secondary and see what is being sent/received to port 520?
You could also try changing this port? (the ports which seem to
recommended these days are 647 for the primary and 847 for the
secondary)
Instead of issuing a restart, maybe do a shutdown, wait ~15 seconds
and then startup dhcpd, just incase something is holding open the port
so when DHCP comes back up it can't open the port, I would expect this
to be logged somewhere but if only the team that developed BIND
logging channels spent some time implementing detailed logging
channels on DHCPD...
HTH
Regarding what happens in the logs at 15:20, there's nothing on the primary
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from recover to
startup
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from startup to
recover
15:15:47 secondary-dhcp dhcpd: Sent update request all message to dhcp
15:18:33 secondary-dhcp dhcpd: dhcp: ignored (recovering)
15:20:02 secondary-dhcp dhcpd: dhcp: ignored (recovering)
meanwhile, requests were coming in from the clients. Here's a fuller picture
of that section of the log where you'll see the secondary continues to note
that it's recovering.
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.109 from
00:50:56:92:00:5d (NUR-022) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.16864.218 from
00:80:77:78:09:11 (BRN_780911) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPINFORM from 192.168.186.103 via
192.168.166.250
15:20:01 secondary-dhcp dhcpd: DHCPACK to 192.168.186.103
(00:1e:c9:37:43:1f) via em2
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.218 from
00:50:56:92:00:45 (LAW-KSK-006) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.162 from
00:50:56:92:0d:b6 (FAC227-010) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.128 from
00:50:56:92:03:1e (FAC-027) via em2: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.194.238 via
192.168.194.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.194.238
(00:18:8b:6e:32:0f) via em2
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0b:db:59:f9:80 via
192.168.39.97: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: dhcp: ignored (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from d0:d0:fd:2a:01:57
(ATS-ACLP-UTDELL) via 192.168.226.1: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.226.19 (172.16.200.2)
from d0:d0:fd:2a:01:57 (ATS-ACLP-UTDELL) via 192.168.226.1: not responding
(recovering)
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0d:56:ee:13:94 via
192.168.180.1: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.176.131 via
192.168.176.1
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.176.131
(00:24:e8:fc:43:d2) via em2
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.222.121 via
192.168.82.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.222.121
(00:23:ae:a1:69:38) via em2
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168155.193 via
192.168155.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168155.193 (00:26:6c:25:47:97)
via em2
15:20:02 secondary-dhcp dhcpd: DHCPREQUEST for 192.16873.186 from
00:1f:5b:ea:97:36 via 192.16873.129: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: BOOTREQUEST from 2c:41:38:7c:c6:e0 via
192.168.164.250: BOOTP from dynamic client and no dynamic leases
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.242.159 via
192.168.242.250
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.242.159
(60:eb:69:32:93:2e) via em2
15:20:03 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.31 from
00:50:56:92:01:70 (FAC227-011) via em2: not responding (recovering)
15:20:03 secondary-dhcp dhcpd: DHCPINFORM from 192.168119.9 via
192.16896.129
15:20:03 secondary-dhcp dhcpd: DHCPACK to 192.168119.9 (d4:85:64:b5:46:f1)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.44.81 via
192.168.44.1
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.44.81 (00:23:ae:55:f4:86)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.220 from
00:50:56:92:00:9c (LAW-046) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 172.16.226.236 from
00:26:bb:5f:96:12 (fac-809982) via 192.168.165.129: not responding
(recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 78:45:c4:0b:43:e3
(heco-pelab03) via 192.168.64.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 00:26:bb:5f:96:12 via
192.168.165.129: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.191.83 via
192.168.191.1
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.191.83 (d4:be:d9:c6:0c:f7)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.213.218 from
90:b1:1c:64:6d:d3 (CM-M3310-867716) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 78:45:c4:1e:55:03
(HECO-PELAB08) via 192.168.64.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.169.111 from
00:50:56:92:0a:fb (CSE-LAB-008) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.198.201 from
00:1f:f3:45:36:7a (LA-ITS827888WD) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.186.43 via
192.168.166.250
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.186.43 (00:22:19:1c:c8:1b)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.16873.186 from
00:1f:5b:ea:97:36 via 192.16873.129: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.55.189 from
3c:07:54:1c:c8:09 via 192.168.55.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.120 from
00:50:56:92:0d:a6 (NUR-045) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.67.180 from
40:6c:8f:3f:84:50 via 192.168.67.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.67.180 from
40:6c:8f:3f:84:50 via 192.168.67.250: not responding (recovering)
...
15:29:54 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0a:95:36:5c:93
(RUNIN-TOP) via 192.168.230.1: not responding (recovering)
15:29:59 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0a:95:35:f3:77 (Xserve)
via 192.168.93.193: not responding (recovering)
Post by Oscar Ricardo Silva
Right before 15:09 we made a configuration change on the primary, adding
a new network definition. In order for that new network to be active we
had to restart dhcpd on the primary. The portion of the config file
holding all the network definitions was copied to the secondary. Once
both servers were online and "normal", dhcpd was restarted on the
secondary so it would then be able to provide services on that new network.
Maybe we're doing something wrong here but it's my understanding that
after a configuration change, dhcpd has to be restarted.
Just in case it's a timing issue I've told coworkers to wait an
additional 10 minutes after the two servers go into "normal" before
restarting the primary.
Post by Steven Carr
What happens in the logs at 15:20 when the servers should then have
passed the MCLT time and come out of recovery?
Why put the system into partner-down at 15:14, what was the reasoning
behind this given both servers were online?
...
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
Steven Carr
2013-05-15 21:26:14 UTC
Permalink
That script with omshell is your problem. Setting the state prepares
DHCPD that the partner is about to go down, so it's looking like the
peers are getting into a muddle as to what state the servers are
actually in. You don't need that. Just restart DHCPD with a normal
init script.

You need to let DHCP handle the failover itself, it will automatically
recover when the services are restarted. You should only ever tell the
DHCP server that the partner is going down if it's already down.
I really, REALLY wish there was a better way of debugging the state between
servers ... some way to debug the messages being exchanged between the
members of a failover pair ...
I checked and double-checked the ACLs between the servers as well as
iptables running on each server. The entire /24 on which each server
resides is explicitly allowed on both. I also double-checked the state of
SELinux and it is completely disabled.
What I've done now is changed the ports for communication between the two.
port 647;
peer port 847;
port 847;
peer port 647;
As for the "restart" process, we do actually shutdown the server, wait for
the PID to be removed, then restart it. We use this little script to take it
#!/bin/sh
omshell << EOF
port 7911
key omkey XYZXYZXYZ
connect
new control
open
set state = 2
update
EOF
#Start looping and testing for the absence of the PID.
while ps -p $DHCPPROC >/dev/null; do
sleep .5
done
sleep 1
Then start /usr/sbin/dhcpd. I'll add an extra 15 second delay before the
service is restarted.
I appreciate you taking an interest and the suggestions.
Post by Steven Carr
So the only thing I can think of at this stage is that the secondary
is failing to synchronise after it comes back online, which is why
it's staying in recovery mode.
Could there be anything blocking port 520? Do these boxes have SELinux
running? When this happens again can you run a packet capture on the
primary and secondary and see what is being sent/received to port 520?
You could also try changing this port? (the ports which seem to
recommended these days are 647 for the primary and 847 for the
secondary)
Instead of issuing a restart, maybe do a shutdown, wait ~15 seconds
and then startup dhcpd, just incase something is holding open the port
so when DHCP comes back up it can't open the port, I would expect this
to be logged somewhere but if only the team that developed BIND
logging channels spent some time implementing detailed logging
channels on DHCPD...
HTH
Regarding what happens in the logs at 15:20, there's nothing on the primary
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from recover to
startup
15:15:47 secondary-dhcp dhcpd: failover peer dhcp: I move from startup to
recover
15:15:47 secondary-dhcp dhcpd: Sent update request all message to dhcp
15:18:33 secondary-dhcp dhcpd: dhcp: ignored (recovering)
15:20:02 secondary-dhcp dhcpd: dhcp: ignored (recovering)
meanwhile, requests were coming in from the clients. Here's a fuller picture
of that section of the log where you'll see the secondary continues to note
that it's recovering.
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.109 from
00:50:56:92:00:5d (NUR-022) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.16864.218 from
00:80:77:78:09:11 (BRN_780911) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPINFORM from 192.168.186.103 via
192.168.166.250
15:20:01 secondary-dhcp dhcpd: DHCPACK to 192.168.186.103
(00:1e:c9:37:43:1f) via em2
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.218 from
00:50:56:92:00:45 (LAW-KSK-006) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.162 from
00:50:56:92:0d:b6 (FAC227-010) via em2: not responding (recovering)
15:20:01 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.128 from
00:50:56:92:03:1e (FAC-027) via em2: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.194.238 via
192.168.194.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.194.238
(00:18:8b:6e:32:0f) via em2
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0b:db:59:f9:80 via
192.168.39.97: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: dhcp: ignored (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from d0:d0:fd:2a:01:57
(ATS-ACLP-UTDELL) via 192.168.226.1: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.226.19 (172.16.200.2)
from d0:d0:fd:2a:01:57 (ATS-ACLP-UTDELL) via 192.168.226.1: not responding
(recovering)
15:20:02 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0d:56:ee:13:94 via
192.168.180.1: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.176.131 via
192.168.176.1
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.176.131
(00:24:e8:fc:43:d2) via em2
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.222.121 via
192.168.82.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.222.121
(00:23:ae:a1:69:38) via em2
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168155.193 via
192.168155.129
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168155.193
(00:26:6c:25:47:97)
via em2
15:20:02 secondary-dhcp dhcpd: DHCPREQUEST for 192.16873.186 from
00:1f:5b:ea:97:36 via 192.16873.129: not responding (recovering)
15:20:02 secondary-dhcp dhcpd: BOOTREQUEST from 2c:41:38:7c:c6:e0 via
192.168.164.250: BOOTP from dynamic client and no dynamic leases
15:20:02 secondary-dhcp dhcpd: DHCPINFORM from 192.168.242.159 via
192.168.242.250
15:20:02 secondary-dhcp dhcpd: DHCPACK to 192.168.242.159
(60:eb:69:32:93:2e) via em2
15:20:03 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.31 from
00:50:56:92:01:70 (FAC227-011) via em2: not responding (recovering)
15:20:03 secondary-dhcp dhcpd: DHCPINFORM from 192.168119.9 via
192.16896.129
15:20:03 secondary-dhcp dhcpd: DHCPACK to 192.168119.9
(d4:85:64:b5:46:f1)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.44.81 via
192.168.44.1
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.44.81
(00:23:ae:55:f4:86)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.220 from
00:50:56:92:00:9c (LAW-046) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 172.16.226.236 from
00:26:bb:5f:96:12 (fac-809982) via 192.168.165.129: not responding
(recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 78:45:c4:0b:43:e3
(heco-pelab03) via 192.168.64.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 00:26:bb:5f:96:12 via
192.168.165.129: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.191.83 via
192.168.191.1
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.191.83
(d4:be:d9:c6:0c:f7)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.213.218 from
90:b1:1c:64:6d:d3 (CM-M3310-867716) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPDISCOVER from 78:45:c4:1e:55:03
(HECO-PELAB08) via 192.168.64.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.169.111 from
00:50:56:92:0a:fb (CSE-LAB-008) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.198.201 from
00:1f:f3:45:36:7a (LA-ITS827888WD) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPINFORM from 192.168.186.43 via
192.168.166.250
15:20:04 secondary-dhcp dhcpd: DHCPACK to 192.168.186.43
(00:22:19:1c:c8:1b)
via em2
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.16873.186 from
00:1f:5b:ea:97:36 via 192.16873.129: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.55.189 from
3c:07:54:1c:c8:09 via 192.168.55.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.199.120 from
00:50:56:92:0d:a6 (NUR-045) via em2: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.67.180 from
40:6c:8f:3f:84:50 via 192.168.67.250: not responding (recovering)
15:20:04 secondary-dhcp dhcpd: DHCPREQUEST for 192.168.67.180 from
40:6c:8f:3f:84:50 via 192.168.67.250: not responding (recovering)
...
15:29:54 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0a:95:36:5c:93
(RUNIN-TOP) via 192.168.230.1: not responding (recovering)
15:29:59 secondary-dhcp dhcpd: DHCPDISCOVER from 00:0a:95:35:f3:77 (Xserve)
via 192.168.93.193: not responding (recovering)
Post by Oscar Ricardo Silva
Right before 15:09 we made a configuration change on the primary, adding
a new network definition. In order for that new network to be active we
had to restart dhcpd on the primary. The portion of the config file
holding all the network definitions was copied to the secondary. Once
both servers were online and "normal", dhcpd was restarted on the
secondary so it would then be able to provide services on that new network.
Maybe we're doing something wrong here but it's my understanding that
after a configuration change, dhcpd has to be restarted.
Just in case it's a timing issue I've told coworkers to wait an
additional 10 minutes after the two servers go into "normal" before
restarting the primary.
Post by Steven Carr
What happens in the logs at 15:20 when the servers should then have
passed the MCLT time and come out of recovery?
Why put the system into partner-down at 15:14, what was the reasoning
behind this given both servers were online?
...
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
_______________________________________________
dhcp-users mailing list
https://lists.isc.org/mailman/listinfo/dhcp-users
Paul B. Henson
2013-05-15 21:53:16 UTC
Permalink
Post by Oscar Ricardo Silva
Just in case it's a timing issue I've told coworkers to wait an
additional 10 minutes after the two servers go into "normal" before
restarting the primary.
Just as a reference point, we store our dhcp configuration in
subversion, and have a job that extracts it whenever it changes and
deploys it on the underlying servers. They both restart at about the
exact same time, and we've never had an issue.

Here's what it looks like:

May 15 14:40:05 mercury dhcpd: failover peer cpp: I move from normal to
startup
May 15 14:40:06 mercury dhcpd: failover peer cpp: peer moves from normal
to communications-interrupted
May 15 14:40:06 mercury dhcpd: failover peer cpp: I move from startup to
normal


May 15 14:40:06 gemini dhcpd: failover peer cpp: I move from
communications-interrupted to startup
May 15 14:40:06 gemini dhcpd: failover peer cpp: I move from startup to
communications-interrupted
May 15 14:40:06 gemini dhcpd: failover peer cpp: peer moves from normal
to normal
May 15 14:40:06 gemini dhcpd: failover peer cpp: I move from
communications-interrupted to normal
Steven Carr
2013-05-15 22:31:21 UTC
Permalink
Just as a reference point, we store our dhcp configuration in subversion,
and have a job that extracts it whenever it changes and deploys it on the
underlying servers. They both restart at about the exact same time, and
we've never had an issue.
May 15 14:40:05 mercury dhcpd: failover peer cpp: I move from normal to
startup
May 15 14:40:06 mercury dhcpd: failover peer cpp: peer moves from normal to
communications-interrupted
May 15 14:40:06 mercury dhcpd: failover peer cpp: I move from startup to
normal
May 15 14:40:06 gemini dhcpd: failover peer cpp: I move from
communications-interrupted to startup
May 15 14:40:06 gemini dhcpd: failover peer cpp: I move from startup to
communications-interrupted
May 15 14:40:06 gemini dhcpd: failover peer cpp: peer moves from normal to
normal
May 15 14:40:06 gemini dhcpd: failover peer cpp: I move from
communications-interrupted to normal
Yep, that's what I would expect to see, that communications are
interrupted between the peers, but the omshell script is flagging the
peers that the partner is down when it's not which is causing the
recovery problem, my guess is one of the peers still thinks they other
is down, or they both think each other is down.
Oscar Ricardo Silva
2013-05-15 22:46:11 UTC
Permalink
Post by Steven Carr
Just as a reference point, we store our dhcp configuration in subversion,
and have a job that extracts it whenever it changes and deploys it on the
underlying servers. They both restart at about the exact same time, and
we've never had an issue.
May 15 14:40:05 mercury dhcpd: failover peer cpp: I move from normal to
startup
May 15 14:40:06 mercury dhcpd: failover peer cpp: peer moves from normal to
communications-interrupted
May 15 14:40:06 mercury dhcpd: failover peer cpp: I move from startup to
normal
May 15 14:40:06 gemini dhcpd: failover peer cpp: I move from
communications-interrupted to startup
May 15 14:40:06 gemini dhcpd: failover peer cpp: I move from startup to
communications-interrupted
May 15 14:40:06 gemini dhcpd: failover peer cpp: peer moves from normal to
normal
May 15 14:40:06 gemini dhcpd: failover peer cpp: I move from
communications-interrupted to normal
Yep, that's what I would expect to see, that communications are
interrupted between the peers, but the omshell script is flagging the
peers that the partner is down when it's not which is causing the
recovery problem, my guess is one of the peers still thinks they other
is down, or they both think each other is down.
One of the things we see when this happens is that the primary gets
"stuck" as far as partner-state and local-state are concerned. When the
secondary is restarted, the primary says its local state is "partner
down" while the partner state is "recover". It stays this way even when
dhcpd is killed on the secondary.

I'll test out using a variation of the init script which kills the
process. I would still like for them to return to a normal state before
restarting the secondary.


Oscar
Steven Carr
2013-05-16 07:23:36 UTC
Permalink
One of the things we see when this happens is that the primary gets "stuck"
as far as partner-state and local-state are concerned. When the secondary
is restarted, the primary says its local state is "partner down" while the
partner state is "recover". It stays this way even when dhcpd is killed on
the secondary.
I'll test out using a variation of the init script which kills the process.
I would still like for them to return to a normal state before restarting
the secondary.
Yes it will do, as you've effectively told each partner in turn that
the other peer is down. Once the peer is down it wont make any attempt
to contact the other peer, it will wait for the peers to be brought
out of partner down mode, at which point a forced recovery may be
needed to get the peers back in sync, but during that forced recovery
neither system will issue leases.

You just have to trust the code I'm afraid, it does work.

Loading...