Practical Problem Management

For many organisations Problem Management is somewhat of a poor relation to corporate Service Desk and Incident Management activity. Whilst the Service Desk and Incident Management processes are adequately staffed, defined, and operated, Problem Management is often ‘something to be done when time allows’. Consequently, Problem Management may never rise to the top of the IT Service Management (ITSM) ‘to do list’. Even if an organisation undertakes reactive Problem Management in response to a Major Incident, say, proactive Problem Management (where dedicated resource actively identifies problems) may never receive due attention. Interestingly, of the major ITIL processes, effective Problem Management activity can provide the highest return to an organisation.

One possible cause of corporate inattention is that problems are often confused with incidents (with the terminology interchanged wrongly), or are seen as an incident state rather than a separate entity requiring a different type of ITSM response. The ITIL definition of a ‘problem’ is the cause of one or more incidents, with Problem Management the process of managing all problems throughout their lifecycle. The process’s primary objective is to prevent problems and resulting incidents from occurring, to eliminate recurring incidents, and to minimise the impact of incidents that cannot be resolved (referred to as known errors within ITIL). In order to achieve this goal, Problem Management must get to the root cause of incidents and then initiate actions to improve or correct the situation.

A practical way to differentiate between problems and incidents is to view an incident as a symptom that is resolved when normal operation (from a customer perspective) is restored. However, a problem is the root or underlying cause of an incident or incidents that is resolved only when the underlying cause is permanently rectified. In ITIL v2, Problem Management was a part of Service Support; in v3, it is now part of Service Operation and, due to conflicting priorities, Butler Group recommends that an organisation treats Incident and Problem Management as separate processes with different process managers.

Problems can be identified just about anywhere within the IT ecosystem: acceptance into production, changes, updates/patches, vendor products, user errors, production execution, and failures. However, the main source for problem identification with an organisation will probably be the analysis of incidents as part of the proactive Problem Management process. Problem management resource should regularly analyse incident and problem data, identifying trends and reporting them, along with problem management metrics and success stories, both within and without the IT function.

For effective Problem Management, the problem management team must provide problem control, which is a process of problem identification and recording, problem classification, problem investigation and diagnosis, and actively tracking problem status through to resolution or known problem/error status. Unfortunately, many organisations fall down at this first hurdle, with IT support attention so focused on incident resolution that it is in a perpetual state of ‘bailing water out of an overflowing bath’ rather than looking to ‘turn off the taps’.

IT management needs to appreciate that far too much costly, and possibly scarce, IT resource is spent fighting repetitive fires and that this resource would be better utilised supporting problem management personnel in tackling the root causes, rather than the symptoms, of IT failures. Outside of the IT function, the business impact of problems, in the form of recurring incidents, can be considerable in terms of lost user productivity or, more importantly, the financial implications of outages to critical business services, and degradation of both customer perception and the reputation of the corporate brand.

For some IT organisations, providing resource for Problem Management activities may be easier said than done. Rather perversely, IT needs to undertake some initial Problem Management activity in order to justify longer-term activity. The analysis of Incident Management data and conversations with key customers will help identify a sample of problems that will allow IT to establish the business-wide costs associated with recurring incidents. When compared to the costs associated with undertaking Problem Management activities, the organisation is able to isolate the potential benefits to be realised through the proactive resolution of problems. In the current financial climate it may be necessary for IT organisations to use this information to justify a short-term proof of concept project that can formally demonstrate benefit delivery.

Importantly, organisations shouldn’t try to do too much too soon. Resource should be focused on a prioritised set of initial problems with activity ramped up as successes are achieved. IT also needs an initial Problem Management strategy that is focused on planning, and delivering, services that are closely aligned to business requirements. This business-oriented approach will help keep IT grounded in the real reasons for Problem Management and away from a perpetual state of ‘analysis paralysis’.

When justifying the adoption of Problem Management to the business, IT should also make it clear that the process works with existing Incident and Change Management processes to ensure that IT service availability and quality are increased. That over time, Problem Management will identify permanent solutions and reduce the number and resolution time of incidents; resulting in less downtime and disruption to business-critical services, reduced expenditure on ineffective workarounds or fixes, and a reduction in the cost of, and effort in, fire-fighting or resolving repeat incidents.

Once an organisation has corporate commitment to problem management resource, it must ensure that they have cradle-to-grave processes formalised right across the problem lifecycle. The operation of the aforementioned problem control process will result in one of three outcomes; that a change is required to correct a problem, a problem cannot be fixed but a workaround has been identified, or no fix or workaround have been identified. In the first instance, the organisation should use an error control process to correct the problem via the corporate Change Management process. In the second instance, the problem is classified as a known error with a workaround – a temporary way of resolving the incident – logged in a known error database and made available to all support teams for ongoing incident resolution activity. Finally, where a problem has been investigated but no solution or workaround identified, this is recorded as a known problem; with the information again made available for the benefit of all support teams. It is important to recognise that these three problem states are not mutually exclusive and that a problem may move between them over time. For instance, when possible, a workaround should still be made available whilst a problem is awaiting the implementation of a required change.

There are a multitude of available problem management tools and techniques to facilitate problem resolution. These include Kepner-Tregoe, Ishikawa diagrams, Component Failure Impact Analysis, Fault Isolation, Technical Observation Post, Fault Tree Analysis, problem brainstorming, CRAMM, trend analysis, chronological analysis, and Pain Value Analysis. Kepner-Tregoe is a useful method of problem analysis for formally investigating deeper-rooted problems and includes the following high-level stages: define the problem; describe the problem in terms of identity, location, time and size; establish possible causes; test the most probable cause; and verify the true cause. Ishikawa diagrams document the causes and effects of one or more incidents. Typically the outcome of a brainstorming session, the problem is represented as the trunk of a ‘tree diagram’ with causes shown as primary and then secondary branches. Creating the diagram stimulates discussion and often leads to increased understanding of complex problems.

As with all processes, the efficient and effective operation of Problem Management requires fit-for-purpose performance metrics. For example, critical success factors include visibly improved service quality, problem cost and productivity impact minimisation, and the reduction in the cost of problem management. These should be supported by a basket of appropriate KPIs such as trend-based, percentage reduction measures in the average time to resolve problems, the average time to implement fixes to known errors, the average number of undiagnosed problems, the number of repeat incidents, the number of incidents and problems affecting business-critical services, and the average cost of handling a problem. Or qualitative measures such as improved customer satisfaction responses relative to business disruption caused by incidents and problems, and volume-based measures such as the total number of problems recorded in the period and the number of problems not resolved within SLA targets. All metrics should be reported by problem category, impact, and priority-level, and trended over time.

Problem Management should have strong relationships with other key ITIL processes. In addition to its linkages with Incident and Change Management, it needs to use Configuration Management data to help determine the impact of problems and resolutions. Availability Management has a dependency on Problem Management information and activity, and some problems will require investigation by Capacity Management teams and techniques. Problem Management can also be an entry point into IT Service Continuity and Major Incident Management, where a significant problem is not resolved before it starts to have a major impact on the business. From a Service Level Management perspective, Problem Management contributes to improvements in service levels, and its management information should be used as the basis for SLA review activity.

To conclude, Problem Management is critical to efficient and effective IT operations and an organisation should seriously look at why it hasn’t formally adopted Problem Management policies, processes, and techniques. The hurdle probably isn’t the process itself, given that it is relatively straightforward when compared to most ITIL processes. It is more likely that suitable resource has never been justified for an activity that can be perceived as distant from day-to-day IT operations. Whilst the justification of dedicated problem management resource may at first appear daunting, the same was probably once true for the provision of Service Desk facilities. The analysis of corporate incident management data should support the need for the proactive management of recurring incidents within most organisations. Go on, do some digging; you might just make IT’s and your customers’ lives a lot easier.

Republished from


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s