Troubleshooting Exchange Server
Despite our care and attention, despite our best efforts to design the perfect Exchange server environment, something will inevitably go wrong at some point. Whether it's an unintended configuration setting, faulty hardware, a change to a dependency, or-gasp-a bug in the product, something invariably happens to cause problems for end users and ultimately for us, the administrators.
So what do you do when the lights go out on the Exchange server, figuratively speaking? The goal of this tutorial is to outline tried-and-true strategies for recovering an Exchange server as quickly as possible.
In this article, you will learn to:
- Narrow the scope of an Exchange server problem
- Use basic Exchange Server troubleshooting tools
- Troubleshoot Mailbox server problems
- Troubleshoot mail transport problems
Basic Troubleshooting Principles
We can't overemphasize this key point: to troubleshoot Exchange Server, you have to understand the architecture. Understanding which functions of Exchange Server are controlled by which server roles is absolutely critical, or else you could spend a lot of time troubleshooting the wrong server.
Troubleshooting Exchange Server 2013 often involves collecting and reviewing information from a series of servers, rather than focusing on one. For example, a user complains that he isn't receiving new email. There are a number of possible causes for this:
- The user's client isn't receiving notifications of new email.
- The user's client can't connect to the Client Access server to retrieve new email.
- All copies of the relevant mailbox database are offline.
- The user's mailbox is full.
- Transport agents preclude delivery of email to this end user.
A closer look at this list shows an interesting breakdown. The first two issues could loosely be categorized as client-access issues, the next two as database issues, and the last as a transport issue. Unfortunately, this no longer corresponds nicely to the Exchange server roles since all those functions have now been rolled into only two roles. We'll cover troubleshooting in this tutorial first by covering the general troubleshooting tools and then by troubleshooting client access, database storage, and then mail flow issues. However, before we dive right into the tools, let's take a moment to consider what troubleshooting involves.
When faced with a technical problem, your immediate impulse is often to jump right into the system and start clicking. While this can be successful, particularly when you're resolving a problem you've seen hundreds of times and know like the back of your own hand, it's not necessarily a reproducible strategy. What happens when you encounter a problem you haven't seen before? What do you do when you truly have no idea what the root cause could be?
The first step in troubleshooting a problem, any problem, is to define what the problem is. In many cases, this requires asking for more information. When an end user says that she can't send email, does she mean that she can't open Outlook? That she can't generate a new email? That she clicks Send but the email never leaves the Drafts or Outbox folder? Or that she's sent messages that were never received? The end result is the same-the user can't send email-but the root causes are very different.
Once the problem has been defined, the next step is to determine the scope of the problem. This often helps clarify the direction of further troubleshooting. By determining how many users are affected-and more importantly, determining what those users have in common-you can rule out some possibilities and focus on things with a greater impact. For example, if one user can't send email, the root cause could be many things unique to that user, from Outlook configuration to network connectivity to a disabled user account.
However, if a second user has a similar issue, it's more likely to be something they have in common. Are they in the same network segment, perhaps? If 10 users on different floors all report Outlook problems, there may possibly be a problem on an Exchange server. Are all 10 users in the same database, for example, or in the same Active Directory site?
There are a number of clarifying questions that are extremely useful in determining the scope of a particular problem:
- How many users are affected by the outage?
- Do all the affected users access Exchange Server through the same method, such as Outlook, Outlook Web App, or ActiveSync?
- What exactly are the users trying to do when they encounter the problem?
- Are other users able to perform the same task without problems?
- Are all of the users in the same database?
- Are all of the users in the same site?
- Does the problem occur all the time, only some of the time, or rarely?
The answers to these will often rule out possibilities right from the start. If one user can't log into Outlook successfully, but another in the same database can, you know immediately that the relevant database must be mounted and accessible, and you can then concentrate on other things.
Speaking of concentrating on other things, one of the most difficult things in troubleshooting is ignoring the unimportant distractions and focusing on what's causing the issue. It's often difficult to differentiate between what's important and what's not unless you know where to start (which is why defining the problem is so important).
Here's an example: an end user reports that he can't send email to a specific user, and during investigation you also discover that he can't access a particular public folder. Is the public folder problem directly related to the email problem? It might be-if the recipient's mailbox is on a server that also houses the only instance of that public folder, and that server is inaccessible, that would explain both problems. But in many cases it might not-the mailbox database that contains the public folder store might be dismounted or the user might not have permissions. Although there's at least one explanation that covers both problems, many more exist that are unique to the secondary problem. The steps to troubleshoot internal mail flow are different from those required to troubleshoot public folder access, so if you're trying to resolve a problem with internal email, concentrate on that and leave the public folder issue for later. Essentially, isolate the issue and start investigating it. Divide and conquer.
General Server Troubleshooting Tools
During troubleshooting, some steps should be the same no matter what the symptoms are. Yes, you need to define the problem, as discussed earlier, and you also need to understand the scope of the issue. But once you've determined that the problem is indeed server-based rather than specific to a group of clients, what next? This section will focus on the key tools you should use first.
Event Viewer (Diagnostic Logging)
Troubleshooting a server involves data collection and analysis, and the best ways to collect that data are the same regardless of server role. The Event Viewer includes detailed information about recent system and application errors, and this should always be an administrator's first move in the event of crisis.
Windows Server 2012 servers have two categories of event logs: Windows logs and Applications and Services logs. The Windows logs contain the event logs available in previous versions of Windows: Application, Security, and System event logs, as well as two new logs available only since Windows Server 2012: the Setup log and the ForwardedEvents log.
Windows logs store events from legacy applications and events that apply to the entire system. Applications and Services logs store events from a single application, such as Exchange Server, or components, such as a specific service, rather than events that might have system-wide impact.
Once you've determined the scope of a problem, and you've positively identified the root cause as server related, your next step should be to check the event logs on the relevant system. Because Exchange Server has so many moving parts, so to speak, you'll often find a large number of events clustered together at the time of the reported issue. The default logging level for the majority of services and categories is Lowest, which means that only critical, error, and warnings of logging level 0 will be written to the event log.
If the events generated during the problem aren't quite enough, you might need to increase the logging level for a specific service and category-for example, MSExchange Transport\ Mail Submission-to Low, Medium, or High. There is another logging level, Expert, but this generates so many events that it should be used only for short periods, typically when working directly with Microsoft support.
As with nearly everything in Exchange Server 2013, you can configure diagnostic logging through either the Exchange Admin Center (EAC) or the Exchange Management Shell (EMS).
In the initial release of Exchange Server 2007, diagnostic logging was removed from the EMC, and the only way you could increase logging for a particular service was by using the Set-EventLogLevel cmdlet. Since PowerShell was still new at the time (Exchange Server 2007 was many administrators' first exposure to it), the change wasn't well received, and so Microsoft reintroduced diagnostic logging control to the console in Service Pack 2, and diagnostic logging control is still an administrator favorite in Exchange Server 2013.
If you run through the installation process of Exchange Server 2013, you will soon realize that logging is a major consideration from the outset. There have been many architectural and operational changes in Exchange Server 2013, but one that receives little fanfare is the minimum space requirement change for the installation partition of your Exchange servers. At minimum, you must have 30 GB of available space on the drive where you install Exchange Server, and I would recommend much more than that. The majority of this space will be filled by log files-not the database transaction log files that you have learned to love and respect but the diagnostic and performance log files you dread to dig through.
The default directory where all log files are found. Logging is enabled by default on all Exchange servers and cannot be disabled, at least in any way that found. Microsoft recommends that you open a call to product support should your entire installation drive become full and cause issues with your Exchange server.
The way to configure diagnostic logging in Exchange Server 2013 is through the Set-EventLogLevel cmdlet. This cmdlet does not take a server parameter. In other words, you have to run the command from the shell on the target server to configure logging. The syntax is relatively straightforward:
Set-EventLogLevel -Identity "MSExchange Transport\Mail Submission" -Level Medium
It's always a good idea to reset the logging back to Lowest when you're finished troubleshooting. Increased logging can add significantly to event log growth, and depending on your settings it might fill up your event log quickly or overwrite events.
Once you've identified the target server and configured logging, you might not see relevant events right away. You may need to reproduce the issue (for example, by having the user send another email or attempt to force a connection for a mail queue) before Exchange Server logs anything of value. Exchange Server events themselves will always appear in the Application event log and in the logging directories.
Diagnostic events include a wealth of information, but the most important pieces are the following:
- Description: Although the field is unnamed in Windows Server 2012, it's the equivalent of the legacy Description field from previous versions of Windows. This includes the text of the event and will in many cases include additional error codes or critical information. For example, the well-known and widely feared-1018 error isn't an event-it's a JET error code that appears within the description text of other ESE events, like ESE error 474. The description may also include a link to further information on the Microsoft support site.
- Source This tells you which component logged the event. Note that this will typically be the underlying service name rather than the "friendly" name.
- Event ID This is the specific event number. Along with the Source, this is the most important information for the event.
- Level This reflects the severity of the event and can range from Informational to Error.
- Logged This displays the date and time of the event in local time. This information is stored in the event in UTC, and the Event Viewer displays the equivalent local time-if you're looking at a remote server, make sure you take this into account!
- Task Category This is the subcomponent of the service that logged the event. Not all services provide this additional information, but the majority of Exchange Server services do. This corresponds to the categories visible in the Manage Diagnostic Logging Properties Wizard or via Set-EventLogLevel.
Depending on the error, you should see information event.
Many Exchange Server events include detailed diagnostic steps in the Description field, which is extremely convenient in times of trouble. Even if the event doesn't provide much information, you might be able to find more on the TechNet Events and Errors Message Center at www.microsoft.com/technet/support/ee/ee_advanced.aspx. Unfortunately, at the time of this writing this site hasn't been updated for Exchange Server 2013; thankfully, we still find relevant information regarding event IDs that have not changed since previous versions of Exchange Server. Simply select the appropriate product (Exchange Server, obviously); select the appropriate version (15.0 for Exchange Server 2013); enter the event ID, source, or both; and then click Go. Assuming the event appears in the TechNet database, you should see a link for additional information, which then provides a detailed explanation of the issue, as well as troubleshooting steps and recommendations. If you can't find information on the specific event here, there's always the Microsoft Knowledge Base (http://support.microsoft.com/search/?adv=1) or your favorite search engine.
The Test-* Cmdlets
PowerShell cmdlets control so much functionality in Exchange Server 2013 that it's not a surprise to see troubleshooting cmdlets as well. The Test-* cmdlets in Exchange Server are solid tools in the back pockets of Exchange Server administrators. My recommendation is to use them frequently since they use few resources on the servers and provide a wealth of useful information. For a complete list of all Test-* cmdlets, and a brief description of each cmdlet, just type the following command in an Exchange Management Shell window:
Help Test-*
Test-SystemHealth
One of the most basic troubleshooting cmdlets is Test-SystemHealth, a handy little tool that quickly collects data about the local server and analyzes it according to Microsoft-recommended practices. The standard syntax is mercifully simple: type Test-SystemHealth, press Enter, and then wait for the output. Unlike many cmdlets, Test-SystemHealth generates a progress bar at the top of the EMS window. This is a useful visual indicator-it's high contrast so you can see it from several feet away When the cmdlet finishes, it displays the results in a simple list format (which you could format with the Format-List cmdlet if you wanted to). The tool displays warnings in yellow and alerts in red.
The resulting data is a mini-health check for your server. The Test-SystemHealth cmdlet will alert you to many common misconfigurations as well as recommended settings.
Test-ServiceHealth
Another extremely useful cmdlet is Test-ServiceHealth, which does what its name suggests: it checks the health of all required Exchange Server services on the server. Since the cmdlet recognizes roles as well, it doesn't check for every service; it only looks for the services the installed roles use. For example, if you're running the test on a Client Access server, it will not check for the MSExchangeMailSubmission service, which is available only on a Mailbox server.
This cmdlet also uses a very simple syntax; just type Test-ServiceHealth, press Enter, and peruse the results. The output from this cmdlet is preformatted into a table and simply reports on the status of the required services; an example of the output is shown.
If you want to quickly check the status of a single server, the two preceding cmdlets can save a lot of time and effort. However, neither cmdlet runs against multiple servers at once. To check the configuration of a group of servers, or even every server in the organization, you need to script the cmdlets to run at a higher level.
New Cmdlets
Each version of Exchange Server since Exchange Server 2007 brings along a new stable of cmdlets. There are three new Test-* cmdlets in Exchange Server 2013, bringing the total to 32: Test-OAuthConnectivity tests application authentication, Test-SiteMailbox tests connectivity to a SharePoint 2013 site mailbox, and Test-MigrationServerAvailability tests connectivity to a migration server during a move of mailboxes to Office 365.