How Somebody Helped Kill dhcpd on Our Network

Discussion:

Martin McCormick

2006-07-31 13:03:20 UTC

We recently had our dhcp V3.0.3 system crash or, I should
say, crashed by a denial-of-service scheme in which the miscreant
ran something on his/her computer that behaved like a dhcp server
enough to send DHCPNAK's to every dhcp request it saw on the
VLAN. This made every client send more traffic to the real dhcp
server and apparently caused the 1-gig processor with 1 gig of
RAM to consume all available memory. For 3 minutes, it generated
about 2,500 "out of memory" messages and then dhcpd finally
exited with a 11) SIGSEG and a core dump. Good old FreeBSD unix
is pretty bullet-proof, but this was more than the box could
handle. The platform continued to operate properly afterwards,
but dhcpd had to be restarted again.

We are in the process of instituting dhcp failover with a
second server although I suspect that this situation would have
added maybe another few minutes to the mayhem before it, to,
succumbed to the hammering since failover isn't going to protect
against external attacks that hit both servers equally.

We are also installing switches with port snooping
capabilities as funds permit to kill off anybody who aspires to
run dhcpd on anything other than the proper dhcp server. This
brings me to my question.

Is there any particular configuration parameter I can use
in dhcpd to make it rate-limit itself since it would have been a
much better outcome for it to come up for air, so to speak,
rather than crash and have to be restarted.

This particular event occurred on a Sunday afternoon on
the weekend that classes finished for the Summer semester and
campus activity was as near to dead as it ever gets around here
so we didn't even know anything was wrong for a period of time.

Our dhcp server gives service to around 10,000 clients of
several types and the platform it is on hardly breaks a sweat
even on the busiest days, but this was too much to handle.

Martin McCormick WB5AGZ Stillwater, OK
Systems Engineer
OSU Information Technology Department Network Operations Group

Darren

2006-07-31 14:11:51 UTC

Permalink

Martin,

I suggest Vlans. Set up as many Vlans as you can and separate the user
traffic. The rogue DHCP server may not have been malicious in nature.
All someone has to do is hook a linksys router up backwards and this
will happen. Its worse when all 10,000 users are in the same broadcast
domain. If it was one vlan of 253 users or something, the damage would
have been minimal. The rogue DHCP server would not have seen the vast
majority of user traffic, and therefore would not have been able to
respond to it with DHCPNAK.

Post by Martin McCormick
We recently had our dhcp V3.0.3 system crash or, I should
say, crashed by a denial-of-service scheme in which the miscreant
ran something on his/her computer that behaved like a dhcp server
enough to send DHCPNAK's to every dhcp request it saw on the
VLAN. This made every client send more traffic to the real dhcp
server and apparently caused the 1-gig processor with 1 gig of
RAM to consume all available memory. For 3 minutes, it generated
about 2,500 "out of memory" messages and then dhcpd finally
exited with a 11) SIGSEG and a core dump. Good old FreeBSD unix
is pretty bullet-proof, but this was more than the box could
handle. The platform continued to operate properly afterwards,
but dhcpd had to be restarted again.
We are in the process of instituting dhcp failover with a
second server although I suspect that this situation would have
added maybe another few minutes to the mayhem before it, to,
succumbed to the hammering since failover isn't going to protect
against external attacks that hit both servers equally.
We are also installing switches with port snooping
capabilities as funds permit to kill off anybody who aspires to
run dhcpd on anything other than the proper dhcp server. This
brings me to my question.
Is there any particular configuration parameter I can use
in dhcpd to make it rate-limit itself since it would have been a
much better outcome for it to come up for air, so to speak,
rather than crash and have to be restarted.
This particular event occurred on a Sunday afternoon on
the weekend that classes finished for the Summer semester and
campus activity was as near to dead as it ever gets around here
so we didn't even know anything was wrong for a period of time.
Our dhcp server gives service to around 10,000 clients of
several types and the platform it is on hardly breaks a sweat
even on the busiest days, but this was too much to handle.
Martin McCormick WB5AGZ Stillwater, OK
Systems Engineer
OSU Information Technology Department Network Operations Group

Tim Peiffer

2006-07-31 14:41:38 UTC

Permalink

We used to have a problem with rogue servers when the big marketing push
for NAT routers and wireless complete with DHCP began hitting the
street. Since then, we have placed filters on the edge ports to deny
DHCP server traffic. Now we only have an issue if a new server pops
up. The clients can't get to their server until we remove the filter
from the server port.

This isn't rate limiting, but it is quite effective at handling the
original source of the problem

In Cisco parlance:

interface G1/0/1
ip access-group No_DHCP_SERVER in
!
ip access-list extended Access_IN
deny udp any eq bootps any log
permit ip any any

Tim Peiffer

Post by Darren
Martin,
I suggest Vlans. Set up as many Vlans as you can and separate the user
traffic. The rogue DHCP server may not have been malicious in nature.
All someone has to do is hook a linksys router up backwards and this
will happen. Its worse when all 10,000 users are in the same broadcast
domain. If it was one vlan of 253 users or something, the damage would
have been minimal. The rogue DHCP server would not have seen the vast
majority of user traffic, and therefore would not have been able to
respond to it with DHCPNAK.

Tim Peiffer

2006-07-31 14:47:03 UTC

Permalink

I apologize for the mis-config. Access_IN is our standard access list
naming and I called the filter No_DHCP_SERVER to be more descriptive..

Tim

interface G1/0/1
ip access-group No_DHCP_SERVER in
!
ip access-list extended No_DHCP_SERVER
deny udp any eq bootps any log
permit ip any any

Post by Tim Peiffer
We used to have a problem with rogue servers when the big marketing push
for NAT routers and wireless complete with DHCP began hitting the
street. Since then, we have placed filters on the edge ports to deny
DHCP server traffic. Now we only have an issue if a new server pops
up. The clients can't get to their server until we remove the filter
from the server port.
This isn't rate limiting, but it is quite effective at handling the
original source of the problem
interface G1/0/1
ip access-group No_DHCP_SERVER in
!
ip access-list extended Access_IN
deny udp any eq bootps any log
permit ip any any
Tim Peiffer

Martin McCormick

2006-07-31 14:53:50 UTC

Permalink

Post by Darren
I suggest Vlans. Set up as many Vlans as you can and separate the user
traffic. The rogue DHCP server may not have been malicious in nature.

Thank you. I neglected to include that the VLAN in
question was a /21, a bit on the big side, but not our entire
network.

Martin McCormick

2006-07-31 15:05:52 UTC

Permalink

Post by Tim Peiffer
I apologize for the mis-config. Access_IN is our standard access list
naming and I called the filter No_DHCP_SERVER to be more descriptive..

Thanks for the good information. Cisco is where we are
going but unfortunately we haven't replaced everything yet.

Tim Peiffer

2006-07-31 15:16:04 UTC

Permalink

Then the way to go is to run snoop,tcpdump, etc on the vlan that the
dhcpd is connected to looking for responses from 68/udp, and run through
a decode mechanism such as dhcpdump Then you will need to maintain a
list of who the legit dhcpd servers are and make decisions on whom/what
to kill. Things to consider are DHCP relay agents which communicate
from 68/udp to 68/udp, Novell servers, JumpStart, PXE boot, etc, for
validity.

tcpdump -s1500 -lenx -X src port 68 | dhcpdump

Tim Peiffer

Post by Martin McCormick

Post by Tim Peiffer
I apologize for the mis-config. Access_IN is our standard access list
naming and I called the filter No_DHCP_SERVER to be more descriptive..

Thanks for the good information. Cisco is where we are
going but unfortunately we haven't replaced everything yet.

Bob Franklin

2006-07-31 15:28:02 UTC

Permalink

Thanks for the good information. Cisco is where we are going but
unfortunately we haven't replaced everything yet.

For the reference, we do this on our Extreme edge switches:

create access-mask dhcp-mask ip-protocol dest-l4port source-l4port ports

create access-list dhcp-rogue access-mask dhcp-mask ip-protocol udp
dest-l4port 68 source-l4port 67 ports 1-48 deny

This works at layer 2 and blocks based on the ingress port 1-48; we don't
have to specify any IPs and only need to unblock if there is a need for
a DHCP server at the edge somewhere (which we don't generally allow).

- Bob

--
Bob Franklin <***@reading.ac.uk> +44 (0)118 378 7147
Systems and Communications, IT Services, The University of Reading, UK

King, Michael

2006-07-31 15:40:01 UTC

Permalink

Martin.

Many vendors have this functionality as well. Which model/manufacture
do you have. Unfortunately, if it's of a certain vintage, it might not
just do it.

-----Original Message-----
Sent: Monday, July 31, 2006 11:06 AM
Subject: Re: How Somebody Helped Kill dhcpd on Our Network

Post by Tim Peiffer
I apologize for the mis-config. Access_IN is our standard

access list

Post by Tim Peiffer
naming and I called the filter No_DHCP_SERVER to be more

descriptive..
Thanks for the good information. Cisco is where we are
going but unfortunately we haven't replaced everything yet.

David W. Hankins

2006-07-31 16:12:12 UTC

Permalink

Post by Martin McCormick
about 2,500 "out of memory" messages and then dhcpd finally
exited with a 11) SIGSEG and a core dump. Good old FreeBSD unix

Can you give me a 'where' on 'gdb /path/to/dhcpd /path/to/dhcpd.core'?

My guess on why dhcpd used more memory is you have ping-check
enabled, and for every client being sent back to it on a DISCOVER,
it had to enter into the scheduler another delayed ACK.

I think we could (should) put a limit on how many ping-checks are
outstanding.

Anyway, to verify that, what you want to do is again on your
gdb prompt have a look at the scheduler's task list:

print *timeouts

I'd expect to see the 'func' struct member be 'lease_ping_timeout'
more often than not. To get a feel for how many are queued, try
looking at ->next pointers:

print *timeouts->next

print *timeouts->next->next

print *timeouts->next->next->next

If they're (generally) all lease_ping_timeout, then I think
that's the most likely culprit.

--
David W. Hankins "If you don't do it right the first time,
Software Engineer you'll just have to do it again."
Internet Systems Consortium, Inc. -- Jack T. Hankins