Problem : NTDS KCC Errors with Event ID 1566, 1311 & 1865

Problem : NTDS KCC Errors with Event ID 1566, 1311 & 1865

Setup: IN our single domain there are 3 sites, each site has a subnet, Site 1 (HQ) has 2 domain controllers running Server 2003 Standard SP1, remote Site 2 and remote Site 3 each have 1 domain controller running Server 2003 Standard Sp1. Domain is 2003 Native mode. At the HQ site the FSMO roles are split across the 2 DC’s server 1 has Infrastructure master, Server 2 has PDC and RID and is a GC. The DC at each remote site is a GC. DNS is Dynamic and AD integrated and replicates to all DCs in the AD domain.
Sites and subnets created in AD sites and services. Replication Links created from: HQ site server 1 from HQ Site server 3 and remote site 2 DC; HQ server 2 from HQ server 1 and remote site 2 DC; Remote site 2 from HQ server 1 and HQ server 2; Remote site 3 from HQ server 1 and HQ server 2. Time sync is internal windows time from the HQ Server 2.
Each site has a firewall which is open to replication traffic as far as can be told by following other technical documents on ports needed to allow replication

Issue: replication works with no issues for 7-10 days, then HQ server 2 begins to record KCC errors in the Directory service log with event ids 1566, 1311 & 1865 Along with DNS event error 6002.

The text of error 1566 :
All domain controllers in the following site that can replicate the directory partition over this transport are currently unavailable.
Site:
CN=Site2,CN=Sites,CN=Configuration,DC=mycompany,DC=co,DC=uk
Directory partition:
CN=Configuration,DC=mycompany,DC=co,DC=uk
Transport:
CN=IP,CN=Inter-Site Transports,CN=Sites,CN=Configuration,DC=mycompany,DC=co,DC=uk

The text of error 1311:
The Knowledge Consistency Checker (KCC) has detected problems with the following directory partition.

Directory partition:
CN=Configuration,DC=mycompany,DC=co,DC=uk
There is insufficient site connectivity information in Active Directory Sites and Services for the KCC to create a spanning tree replication topology. Or, one or more domain controllers with this directory partition are unable to replicate the directory partition information. This is probably due to inaccessible domain controllers.

The text of error 1865 is:
The Knowledge Consistency Checker (KCC) was unable to form a complete spanning tree network topology. As a result, the following list of sites cannot be reached from the local site.
Sites:
CN=Site2,CN=Sites,CN=Configuration,DC=mycompany,DC=co,DC=uk

The DNS evnt id 6002 text is:
The transfer of version 11001 of zone mycompany.co.uk by the DNS server was aborted by the server at 192.168.0.2. To restart the transfer of the zone, you must initiate transfer at the secondary server

While these errors occur the HQ server 1 cannot connect to the DC at remote sites 2 & 3 by \\servername the error returned is Windows cannot find the server “servername”. Although ping will reply with response and normal response time. Nslookup remote servername returns result fine.
Dcdiag will fail the kccevent test as well as reporting number of failed replications
The only way this can be cured is to reboot the server. HQ Server 2 will also report this error a few days later.

Question: what is the issue here? why would KCC errors happen consistently after the same time period on each HQ server?


Solution: NTDS KCC Errors with Event ID 1566, 1311 & 1865

I think this site should be of great help to you:
http://eventid.net/display.asp?eventid=1311&eventno=524&source=NTDS%20KCC&phase=1
EventId.net is very good for helping find specfic examples of Event ID problems. The Link I gave here specfically deals with your problems and can point in the right direction