These days it’s very common to have a single hardware device on the network acting as firewall / gateway / router / proxy / content filter; great for providing a high degree of security while saving cost. Some providers even offer the boon of a managed automatic update to these devices, which sounds great - I mean, why wouldn’t you want to be automatically patched against a vulnerability as soon as there’s a fix or take advantage of new features as soon as they’re available?

Newer is not always better though; problems can be introduced through this automatic update process and when it happens in the background without the administrator being aware on a device that is at the core of the network, bad things happen.

We have a Cyberhound and in this case I’m going to show how a bug in an update pushed out automatically by the provider caused havoc on our internal network.

Problem
Rolling the clock back to the last week of term, staff were busily working on student reports and getting all of their admin done before the holidays started. Demand on the network was high and it was really not the time to have an impromptu major failure.

It started with shared resources dropping out. We have 2 top-level forests, one for students and the other for staff, and it looked like the trust had failed; none of the staff could access the resources in the student domain and none of the student domain servers could even ping names of other servers in the staff domain.

All of this was bad news for a lot of teachers trying to do pretty much anything that involved accessing network resources. All of our backup agents outside of the primary site were AWOL and we were filling buckets with the eventlog errors that were being recorded.

Troubleshooting
In any environment where you have a whole bunch of Active Directory servers working with a bunch of domains and a bunch of sites, replication and DNS troubleshooting can get a bit tricky.

I started troubleshooting this as a trust issue, thinking that the trust accounts must have somehow been corrupted. Nope, not that. I quickly found that I was unable to establish the trusts and kept getting a ‘domain unavailable’ message.

“Aha!” I thought, “It’s a network issue. This will be easy, probably just a router reboot or something.”

I hopped on to the PDC emulator FSMO holder of both top-level domains and did a few tests.

IP connectivity was in place between all servers and they could all ping each other. Great, I thought, it’s not a routing issue. DNS configuration on network interfaces was correct on all Domain Controllers.

The result of a few ‘repadmin’ runs showed that replication to servers in the same site was successful but was failing across sites.
Increasingly worried, I did a couple of basic DNS checks, still on the servers in different domains.

From one server, I ran nslookup and connected to the other. I asked it to resolve google.com and it did so, successfully. Just in case that was being returned from the local resolver cache I looked up www.howmanypeopleareinspacerightnow.com and got that too.
Using the same nslookup session, I queried for that very same server’s name and I got an NXDOMAIN reply. I tried the FQDN; same result.

“Hum. So my DNS servers will resolve external Internet names but they won’t resolve internal ones, not even their own? That’s a new one on me.”

Just to check I wasn’t losing my marbles, I checked that member servers on the same site could correctly do local name lookups on the same servers; they could, no problem.

I cracked open the DNS consoles and looked for anything that would exclude source IPs or filter client responses. Given that no changes had been made since the problem started happening, a change to the DNS server configuration seemed unlikely but I had to check anyway. Sure enough, nothing was there and the servers were still set to listen on all interfaces with no filtering.

While I was in there I checked the secondary / stub zones for the other domains. They were there alright, but when I tried to open them I got the error ‘zone never loaded’. I removed the zones and tried to re-add them. When I was specifying the IP address of the server to load the zone from, the console looked up its name successfully but then gave an error saying ‘server is not authoritative for this zone’.

Back on the other side, I checked the NS records for the zone and of course they had all the right servers listed and were coming back authoritative in nslookups from the same site.

“Right,” I thought, after hours of Googling to find related problems and having discounted each as irrelevant, “My DNS is clearly absolutely stuffed and I need to call Microsoft PSS”.

I was halfway through evaluating my support options when I recalled that Cyberhound had done an automatic update recently and thought to check when. Version 28.0.11 was automatically installed on Thu 27th June at just after midnight. The first event log error on the servers was less than 1 hour later. It was just too close to be a coincidence.

The Solution
After consulting with the support gurus at Cyberhound, we narrowed this down to a feature that had been improved in the latest software version called ‘DNS Boundary Redirection’. This is designed to redirect any DNS queries from clients on the LAN to servers on the Internet via the DNS cache on the Cyberhound itself. This improves performance, saves bandwidth and makes it impossible to do an unauthorised zone transfer from the internal domain to a system on the Internet; sounds useful, right?

DNS boundary redirection option

In my case however, it was selectively intercepting and altering DNS queries between local subnets, resulting in all of my Active Directory replication problems. I immediately disabled this on the Cyberhound appliances at both sites and discovered that DNS queries were resolving successfully and was eventually able to get my replication errors resolved. I reported this finding to Cyberhound and they are working on a fix.

Needless to say from this point forward I’m going to keep a much closer eye on automatic updates to my firewall!