With many Exchange projects, monitoring comes late in the project, or even worse, after the project completes. Having a robust monitoring solution and processes can greatly improve your ability to identify, troubleshoot, and repair issues before impacting end users. Designing a carefully thought out and comprehensive monitoring solution is worth the time it saves by reducing services outages that can translate into hard dollar costs. A complete solution should do the following:

Identify performance issues The faster the solution helps an administrator determine cause, the quicker the time to problem resolution will be. Many organizations rely on their end users to alert them to issues. This causes end-user dissatisfaction and drives up help desk calls.
Identify growth trends Over time, usage patterns change, business needs change, and hardware modifications may be needed to satisfy these changes. Trending helps forecast these changes and can proactively fix an environment before it becomes an issue.
Track performance against established service level agreements It is important to work with the business and management to report on how Exchange is meeting the metrics that have been defined. This can help quantify the value Exchange server administrators are providing, and may help with upgrades and other areas that need cost justifications.
Track configuration changes Some Exchange environments can have hundreds of servers, all of which need ongoing maintenance. Even in smaller organizations, ensuring that settings remain consistent across servers can be challenging. Reviewing changes against known best practices also keeps service running smoothly.

Monitoring Exchange 2010 also requires a monitoring solution that can gather information from dependent services, such as Active Directory or DNS. Many enterprise environments have some form of a monitoring solution across their IT infrastructure. In cases where a comprehensive solution does not exist, or companies would like to integrate a best of breed solution for Exchange 2010, System Center Operations Manager 2007 R2 (SCOM) is an enterprise-class monitoring solution. Fundamentally, tools such as SCCM build upon the freely available tools that are part of Windows and Exchange. So although you can use the tools independently, using a comprehensive solution makes it easier to manage midsized or large environments.

Performance Monitor

When monitoring Exchange Server 2010, administrators should know which aspects of server performance are the most important. Performance counters and threshold values can identify potential issues, as well as identify the root cause of issues when troubleshooting. This section highlights some of the most important counters for each role. It is important to know that sometimes these thresholds need to be adjusted to work in a specific environment. One exercise is to create a baseline measurement of server performance during normal operations. After the baseline has been established, set thresholds to let administrators know when performance metrics are not met.

1. Performance Data for the Mailbox Server

Many performance counters are available for the Mailbox server; the storage-based counters listed in Table 1 are a good starting place when collecting performance data.

Table 1. Storage-Based Counters
COUNTER	DESCRIPTION	EXPECTED VALUE
MSExchange DatabaseI/O Database Reads (Attached) Average Latency	Shows the average time for reading data from the active database file.	The active copy on average should be below 20 ms. Spikes should not exceed 100 ms.
MSExchange DatabaseI/O Database Reads (Recovery) Average Latency	Shows the average time for reading data from passive database file.	Passive copies on average should be below 200 ms. Spikes should not exceed 1,000 ms.
DatabaseDatabase Page Fault Stallssec	Shows the rate of page faults that cannot be serviced because no pages are available for allocation from the database cache.	Should be 0. Values above 0 indicate database write average latency is too high.

When collecting performance data for the Mailbox server, the focus is generally on storage response times. Slow response times will directly impact the user experience and passive copy replay. If the disk subsystem is not meeting demand, fixing the problem may require additional disks, faster disks, or modifying the disk configuration.

Table 2 shows the Information Store RPC processing counters. If RPC counters indicate a problem, several causes are possible.

Storage Subsystem Ensure that I/O read/write latencies are not excessive. Correlate the storage-based counters with RPC counters to see whether they align.
Network components Check network card settings for errors, dropped packets, network speed, and duplex settings.
CPU Processor Identify whether the CPU is running near capacity. If the CPU is taxed, it will not be able to process RPC calls.
Applications Identify applications that cause high amounts of RPC calls. Use Client Throttling Policies to prevent an application from consuming excessive server resources.

Table 2. RPC-Based Counters
COUNTER	DESCRIPTION	EXPECTED VALUE
MSExchangeISRPC Requests	Indicated the overall RPC requests that are currently executing within the store process.	Should be below 70 at all times.
MSExchangeISRPC Averaged Latency	Shows the RPC latency, in milliseconds, averaged for all operations in the last 1,024 packets.	Should not exceed 10 ms on average.
MSExchangeIS Client (*)RPC Averaged Latency	Shows a server RPC latency in milliseconds averaged for the past 1,024 packets for a specific client protocol. The value of Other is used for MAPI clients.	Should not exceed 10 ms on average.

2. Performance Data for the Transport and Edge Server

The key counters for transport center on queues. Monitoring transport queues ensures timely message delivery.

Table 3. Transport- and Edge-Based Counters
COUNTER	DESCRIPTION	EXPECTED VALUE
MSExchangeTransport Queues(_total)Aggregate Delivery Queue Length (All Queues)	The number of messages from all queues waiting for delivery.	Should be less than 3,000 and not more than 5,000.
MSExchangeTransport Queues(_total)Active Remote Delivery Queue Length	The number of messages in the active remote delivery queues.	Should be less than 250 at all times.
MSExchangeTransport Queues(_total)Submission Queue Length	The number of messages in the submission queue.	Should not exceed 100. Sustained high values indicate Active Directory or Mailbox server bottlenecks.
MSExchangeTransport Queues(_total)Retry Remote Delivery Queue Length	The number of messages in a retry state in the remote delivery queues.	Should not exceed 100. Investigate the next hop to determine the causes for queuing.
MSExchangeTransport Queues(_total)Poison Queue Length	The number of messages in the poison message queue.	Should be 0 at all times.

3. Performance Data for the Client Access Server

The Client Access Server key counters center on client services such as Outlook Web App and Exchange Web Services.

Table 4. Client Access Server—Based Counters
COUNTER	DESCRIPTION	EXPECTED VALUE
ASP.Netapplication restarts	The number of times the application has been restarted during the Web server’s lifetime.	Should be 0 at all times.
ASP.NetRequest Wait Time	The number of milliseconds the most recent request was waiting in the queue.	Should be 0 at all times.
ASP.Net Applications(*)Requests in Application Queue	The number of requests in the application request queue.	Should be 0 at all times.
MSExchange OWAAverage Search Time	The average time that elapsed while waiting for search to complete.	Should be less than 5,000 ms at all times.
MSExchange Control PanelRequests – Average Response Time	The average time in milliseconds ECP took to respond to a request during the sampling period.	Should be under 6,000 ms.

4. Performance Data for the Unified Messaging Server

Unified Messaging key counters monitor UM availability.

Table 5. Unified Messaging—Based Counters
COUNTER	DESCRIPTION	EXPECTED VALUE
MSExchangeUMAvailability% of Failed Mailbox Connection Attempts Over the last hour	The percentage of mailbox connection attempts that failed in the last hour.	Should be less than 5 percent.
MSExchangeUMAvailability% of Messages Successfully Processed Over the Last Hour	The percentage of messages that were successfully processed by UM in the last hour.	Should be greater or equal to 95 percent.
MSExchangeUMAvailabilityDirect Access Failures	The number of times that attempts to access Active Directory failed.	Should be 0 at all times.

Notes From The Field: Exchange Perfmon

Andy Schan

Senior Consultant, Schan Consulting, Inc., Canada

For several of my clients, I’ve been asked a question along the lines of “We’re having performance issues—which Exchange Performance Monitor counters should we look at?” My first response is to suggest that they look at it like any other Windows server and check the “big 4” first: CPU, memory, disk, and network. For example, if the network is saturated or the server is memory-starved, any Exchange-specific counters, such as RPC latency or message queues, are very likely to be only symptoms of the fundamental problem. You can’t overlook the Exchange metrics you should be monitoring, but don’t forget the fundamentals, either .

Another question I’ve been asked is, “The performance has degraded since we first deployed the servers; what has changed?” I then ask how the current performance compares to their baselines; that is the point when some people feel sheepish and say that gathering baselines has been on the to-do list for quite a while, but they’ve never had time to do it. Baselines are a crucial component of your monitoring and reporting strategy, and gathering and maintaining them should be made part of your deployment project plan, and should be on the critical path so that they are done before the project is closed off. It can be challenging and time-consuming to gather baselines at the appropriate time; it does no good to baseline your Exchange Server 2010 environment when only 100 of your 50,000 users have been migrated to Exchange Server 2010. On the other hand, if you leave it until the very end of the project, you may find yourself under pressure to simply close off the project and hand it over to the operational support group, who may not have the time or the skill sets required to gather comprehensive baseline metrics.

5. Windows Server and Active Directory

When troubleshooting Exchange, an administrator often needs to also troubleshoot related services. Active Directory and the underlying operating system are intimately tied together with Exchange. In many companies the Exchange administrators are separated from the network, DNS, and Active Directory infrastructure. Conversely, these other administrators are often not aware of how dependent Exchange is on these related infrastructure services. More mature organizations that are aware of the inter-relationships often create operational-level agreements (OLAs). OLAs are similar to SLAs—the key difference is that the agreements are between internal groups working to support an SLA, and the SLA is the agreement with a business group. To state it a different way, OLAs are created to ensure that core activities performed by different support teams are clearly aligned to meet the agreed-upon business SLAs. This is a fundamental piece of service-level management. As such, Exchange administrators should have enough knowledge to assist other support groups and know where the interdependencies are.

Windows Server 2003 introduced the Server Performance Advisor (SPA) to assist in identifying performance-related issues. Unfortunately, this tool is not compatible with Windows Server 2008/R2, but much of the SPA functionality was built directly into Windows 2008 in the performance monitor tool. The addition of data collector sets allows for collections of monitor counters and system traces that are all related to a specific purpose. Windows 2008 comes with a number of predefined collections: Active Directory Diagnostics, System Diagnostics, System Performance, and LAN diagnostics. Frequently when troubleshooting Active Directory, the first things to look at are search performance and LDAP performance. Using data collection sets, as shown in Figure 1, makes this easy to do.

Figure 1. Performance monitor data collector sets

After the data has been collected, reports can be generated. Some of the more interesting items are the LDAP client with the most CPU, and the searches with the most CPU. Figure 2 shows an example of a report.

You can quickly identify any clients consuming large amounts of the server’s resources. Additionally, the full search query terms are displayed. Custom applications frequently perform inefficient LDAP searches and might require the use of a Network sniffer to troubleshoot. That typically requires involving groups outside of infrastructure, or deep knowledge of how to read and understand network traces. DNS information can be retrieved with the System Performance data collection.

Keep in mind that the report only shows the symptoms of an issue, and the art of troubleshooting is working through a process to determine the root cause.

Figure 2. Performance Monitor report

You may notice that no pre-built data collector sets are specific to Exchange Server. The Exchange team posted role-based XML files that you can use to create a user-defined data collector set. The XML files are at http://blogs.technet.com/mikelag/archive/2010/01/11/perfwiz-for-exchange-2010.aspx . XML files are currently available for the Mailbox role, and one with all counters.

Notes From The Field: Creating a Report of Performance Data

Alessandro Goncalves

Senior Support Engineer, Microsoft Corporation, USA

When you need to create a report of performance file, there is a very useful tool called PAL—available at http://pal.codeplex.com/Wikipage —that enables you to quickly create a report full of information that can speed up the process of discovering a performance issue and help you with performance troubleshooting.

You can also use it to help determine your own thresholds for a specific environment—the numbers are just a suggestion that works for the majority of scenarios, but might apply to your particular case. Analyze it carefully and trace back issues with the performance monitor numbers.

Always remember to stop performance monitoring collection after you have completed gathering the data you require—the data collecting itself may cause performance impact on normal operations.

These data collection sets make it easy to capture key performance counters, but it is important to know which performance aspects are the most important. There is a tense relationship between running servers to maximize utilization while not crossing the threshold into impacting server performance. Underutilized servers cost money (power, hardware, and options), and can possibly be consolidated. Knowing the key indicators can help an administrator determine when more capacity is needed or excess capacity can be reduced.

5.1. Processor

The processor is one of the core components that need monitoring to ensure server health. Standard counters include the total percentage of processor time, the percentage of user-mode processor time, and the percentage of privilege-mode processor time. Another key counter is the processor queue length. If this length is outside of the operating threshold, the processor may have more work than it can handle. This may be a good indicator that a server could use a faster processor or additional cores.

Notes From The Field: Exchange and Hyper-V CPU Utilization Troubleshooting

Alessandro Goncalves

Senior Support Engineer, Microsoft Corporation, USA

When dealing with a Virtualized Exchange Server 2010, it is very important to select the correct counters to accurately measure CPU performance as well as any other object .

When we have Exchange Server 2010 as a guest in a Hyper-V environment, the counters in the GUEST Virtual Machine do not show the actual physical processor utilization. Instead, we need to use Hyper-V Hypervisor Performance Counters in the root partition to determine the real CPU load. We also need to keep in mind that the host hardware is sharing its resources among other loads and virtual machines and it is important to identify whether another virtual machine is creating the bottleneck in the host and thus affecting Exchange .

Therefore, as far as collecting data for performance is concerned, a more precise picture is obtained by collecting it in the root partition, or in the host itself. As far as the CPU is concerned, a CPU spinner program inside the guest can lead the CPU in the guest operating system to be at 100 percent level, but it has little effect on the host and other virtual machines running. Thus one very important thing to recognize is which virtual machine you are measuring, and %Guest Run Time will match the instance name with the virtual machine you need to measure. It is very likely to be misled by an Exchange server running in a virtual environment and its CPU usage, because the guest could be spinning the CPU and task manager in the virtual machine, showing it is at 100 percent CPU utilization, whereas in the root partition the overall CPU utilization is nowhere near that number. It is critical to realize that in a virtual environment, the accurate measure more often than not comes from the root partition. Please use the additional resources at the end of the chapter for more information.

5.2. Memory

Exchange 2010 benefits by using the 64-bit processor’s ability to address large amounts of memory efficiently. Monitoring the page file can reveal performance impact if the server has to constantly swap pages to slower disk storage. Increasing memory or reducing the server’s load are ways to improve performance.

5.3. Active Directory Access

Exchange heavily relies on Active Directory for everything from configuration information to user property information. Therefore, it is critical to monitor response times for searches and reads. Ensure that you have a good ratio of Exchange server cores to Active Directory cores. If the Global Catalog servers are running on a 32-bit platform, the recommended ratio of Active Directory server cores to Exchange Mailbox Server cores is 1:4. If the Global Catalog servers are running on a 64-bit platform, the recommended ratio of Active Directory server cores to Exchange Mailbox Server cores is 1:8 (assuming the Active Directory server has enough memory to cache the entire Active Directory database in memory). Upgrading Active Directory servers to 64 bit may be a way to get more performance without scaling out with new hardware.

5.4. Disk Storage

Finally, Exchange is a very demanding database application that requires consistent and fast disk access. When I/O counters fall outside of normal range, client performance is directly impacted. For example, Outlook and Outlook Web App (OWA) users will report “Outlook is slow” when opening mail or moving between folders. Faster disks, more disks, or redesigning existing disk arrays are all ways to mitigate storage issues.

Notes From The Field: Consider Active Directory Replication Delays in Exchange 2010 Troubleshooting

Markus Bellmann

Senior Solution Architect, Siemens AG, Germany

With one customer, we ran into the issue that newly created databases did not mount immediately. The error was “Active Directory operation failed on <servername>. This error is not retriable. Additional information: The name reference is invalid”. The reason for that behavior is that Remote PowerShell connects to a Client Access server, not to a Mailbox server. The good news is that Exchange will try to mount the database repeatedly, so when the information is replicated to the configuration domain controller responsible for the Mailbox server, the database will be automatically made available.

The bottom line is that when creating mailbox databases using Remote PowerShell, you may want to make sure that the Mailbox server and the corresponding Client Access servers use the same domain controller for configuration. You can verify this with the Get-ADServerSettings |fl cmdlet, and configure the domain controller that is used by PowerShell with the Set-ADServerSettings cmdlet.

Monitoring Exchange Server 2010 (part 2) – System Center Operations Manager 2007 R2

Monitoring Exchange Server 2010 (part 1) – Performance Monitor