With many Exchange projects, monitoring comes late in the project, or even worse, after the project completes. Having a robust monitoring solution and processes can greatly improve your ability to identify, troubleshoot, and repair issues before impacting end users. Designing a carefully thought out and comprehensive monitoring solution is worth the time it saves by reducing services outages that can translate into hard dollar costs. A complete solution should do the following:
-
Identify performance issues The faster the solution helps an administrator determine cause, the quicker the time to problem resolution will be. Many organizations rely on their end users to alert them to issues. This causes end-user dissatisfaction and drives up help desk calls.
-
Identify growth trends Over time, usage patterns change, business needs change, and hardware modifications may be needed to satisfy these changes. Trending helps forecast these changes and can proactively fix an environment before it becomes an issue.
-
Track performance against established service level agreements It is important to work with the business and management to report on how Exchange is meeting the metrics that have been defined. This can help quantify the value Exchange server administrators are providing, and may help with upgrades and other areas that need cost justifications.
-
Track configuration changes Some Exchange environments can have hundreds of servers, all of which need ongoing maintenance. Even in smaller organizations, ensuring that settings remain consistent across servers can be challenging. Reviewing changes against known best practices also keeps service running smoothly.
Monitoring Exchange 2010 also requires a monitoring solution that can gather information from dependent services, such as Active Directory or DNS. Many enterprise environments have some form of a monitoring solution across their IT infrastructure. In cases where a comprehensive solution does not exist, or companies would like to integrate a best of breed solution for Exchange 2010, System Center Operations Manager 2007 R2 (SCOM) is an enterprise-class monitoring solution. Fundamentally, tools such as SCCM build upon the freely available tools that are part of Windows and Exchange. So although you can use the tools independently, using a comprehensive solution makes it easier to manage midsized or large environments.
Performance Monitor
When monitoring Exchange Server 2010, administrators should know which aspects of server performance are the most important. Performance counters and threshold values can identify potential issues, as well as identify the root cause of issues when troubleshooting. This section highlights some of the most important counters for each role. It is important to know that sometimes these thresholds need to be adjusted to work in a specific environment. One exercise is to create a baseline measurement of server performance during normal operations. After the baseline has been established, set thresholds to let administrators know when performance metrics are not met.
1. Performance Data for the Mailbox Server
Many performance counters are available for the Mailbox server; the storage-based counters listed in Table 1 are a good starting place when collecting performance data.
When collecting performance data for the Mailbox server, the focus is generally on storage response times. Slow response times will directly impact the user experience and passive copy replay. If the disk subsystem is not meeting demand, fixing the problem may require additional disks, faster disks, or modifying the disk configuration.
Table 2 shows the Information Store RPC processing counters. If RPC counters indicate a problem, several causes are possible.
-
Storage Subsystem Ensure that I/O read/write latencies are not excessive. Correlate the storage-based counters with RPC counters to see whether they align.
-
Network components Check network card settings for errors, dropped packets, network speed, and duplex settings.
-
CPU Processor Identify whether the CPU is running near capacity. If the CPU is taxed, it will not be able to process RPC calls.
-
Applications Identify applications that cause high amounts of RPC calls. Use Client Throttling Policies to prevent an application from consuming excessive server resources.
2. Performance Data for the Transport and Edge Server
The key counters for transport center on queues. Monitoring transport queues ensures timely message delivery.
3. Performance Data for the Client Access Server
The Client Access Server key counters center on client services such as Outlook Web App and Exchange Web Services.
4. Performance Data for the Unified Messaging Server
Unified Messaging key counters monitor UM availability.
5. Windows Server and Active Directory
When troubleshooting Exchange, an administrator often needs to also troubleshoot related services. Active Directory and the underlying operating system are intimately tied together with Exchange. In many companies the Exchange administrators are separated from the network, DNS, and Active Directory infrastructure. Conversely, these other administrators are often not aware of how dependent Exchange is on these related infrastructure services. More mature organizations that are aware of the inter-relationships often create operational-level agreements (OLAs). OLAs are similar to SLAs—the key difference is that the agreements are between internal groups working to support an SLA, and the SLA is the agreement with a business group. To state it a different way, OLAs are created to ensure that core activities performed by different support teams are clearly aligned to meet the agreed-upon business SLAs. This is a fundamental piece of service-level management. As such, Exchange administrators should have enough knowledge to assist other support groups and know where the interdependencies are.
Windows Server 2003 introduced the Server Performance Advisor (SPA) to assist in identifying performance-related issues. Unfortunately, this tool is not compatible with Windows Server 2008/R2, but much of the SPA functionality was built directly into Windows 2008 in the performance monitor tool. The addition of data collector sets allows for collections of monitor counters and system traces that are all related to a specific purpose. Windows 2008 comes with a number of predefined collections: Active Directory Diagnostics, System Diagnostics, System Performance, and LAN diagnostics. Frequently when troubleshooting Active Directory, the first things to look at are search performance and LDAP performance. Using data collection sets, as shown in Figure 1, makes this easy to do.
Figure 1. Performance monitor data collector sets
After the data has been collected, reports can be generated. Some of the more interesting items are the LDAP client with the most CPU, and the searches with the most CPU. Figure 2 shows an example of a report.
You can quickly identify any clients consuming large amounts of the server’s resources. Additionally, the full search query terms are displayed. Custom applications frequently perform inefficient LDAP searches and might require the use of a Network sniffer to troubleshoot. That typically requires involving groups outside of infrastructure, or deep knowledge of how to read and understand network traces. DNS information can be retrieved with the System Performance data collection.
Keep in mind that the report only shows the symptoms of an issue, and the art of troubleshooting is working through a process to determine the root cause.
Figure 2. Performance Monitor report
You may notice that no pre-built data collector sets are specific to Exchange Server. The Exchange team posted role-based XML files that you can use to create a user-defined data collector set. The XML files are at http://blogs.technet.com/mikelag/archive/2010/01/11/perfwiz-for-exchange-2010.aspx. XML files are currently available for the Mailbox role, and one with all counters.
Notes From The Field: Creating a Report of Performance DataSenior Support Engineer, Microsoft Corporation, USA When you need to create a report of performance file, there is a very useful tool called PAL—available at http://pal.codeplex.com/Wikipage—that enables you to quickly create a report full of information that can speed up the process of discovering a performance issue and help you with performance troubleshooting. You can also use it to help determine your own thresholds for a specific environment—the numbers are just a suggestion that works for the majority of scenarios, but might apply to your particular case. Analyze it carefully and trace back issues with the performance monitor numbers. |
Always remember to stop performance monitoring collection after you have completed gathering the data you require—the data collecting itself may cause performance impact on normal operations.
These data collection sets make it easy to capture key performance counters, but it is important to know which performance aspects are the most important. There is a tense relationship between running servers to maximize utilization while not crossing the threshold into impacting server performance. Underutilized servers cost money (power, hardware, and options), and can possibly be consolidated. Knowing the key indicators can help an administrator determine when more capacity is needed or excess capacity can be reduced.
5.1. Processor
The processor is one of the core components that need monitoring to ensure server health. Standard counters include the total percentage of processor time, the percentage of user-mode processor time, and the percentage of privilege-mode processor time. Another key counter is the processor queue length. If this length is outside of the operating threshold, the processor may have more work than it can handle. This may be a good indicator that a server could use a faster processor or additional cores.
5.2. Memory
Exchange 2010 benefits by using the 64-bit processor’s ability to address large amounts of memory efficiently. Monitoring the page file can reveal performance impact if the server has to constantly swap pages to slower disk storage. Increasing memory or reducing the server’s load are ways to improve performance.
5.3. Active Directory Access
Exchange heavily relies on Active Directory for everything from configuration information to user property information. Therefore, it is critical to monitor response times for searches and reads. Ensure that you have a good ratio of Exchange server cores to Active Directory cores. If the Global Catalog servers are running on a 32-bit platform, the recommended ratio of Active Directory server cores to Exchange Mailbox Server cores is 1:4. If the Global Catalog servers are running on a 64-bit platform, the recommended ratio of Active Directory server cores to Exchange Mailbox Server cores is 1:8 (assuming the Active Directory server has enough memory to cache the entire Active Directory database in memory). Upgrading Active Directory servers to 64 bit may be a way to get more performance without scaling out with new hardware.
5.4. Disk Storage
Finally, Exchange is a very demanding database application that requires consistent and fast disk access. When I/O counters fall outside of normal range, client performance is directly impacted. For example, Outlook and Outlook Web App (OWA) users will report “Outlook is slow” when opening mail or moving between folders. Faster disks, more disks, or redesigning existing disk arrays are all ways to mitigate storage issues.