Why You Should Design Your CMMS for Reliability Engineering

Answered October 30 2020

I can’t think of a better objective than reliability and availability. Every computerized maintenance management system (CMMS) has powerful features, but the magic is in the core team vision for operational excellence. Unfortunately, the CMMS community in general does not understand how to maximize value by focusing on things that matter. In this piece, I’ll show how the core team and reliability team can work together to leverage failure data in the CMMS to make better decisions.

Figure 1: Statistic to explain the state of things

RCM Analysis Helps to Identify Accurate Maintenance Strategies

Reliability-centered maintenance (RCM) analysis provides a structured framework for analyzing the functions and potential failures for a physical asset with a focus on preserving functions. The RCM method documents functions, ways it can have functional failure, failure modes, failure effects, consequences, and proactive (or default) actions. This analysis provides the best way to determine the proper maintenance strategies for critical assets and systems. The RCM failure mode is necessary to discover the correct mitigating task. These failure modes identify the component, component problem, and cause of the failure, e.g., fuel pump motor bearing seized due to lack of lubrication. 

failure data chart

Figure 2: RCM analysis example

Parsing the Failure Mode

Figure 2 shows multiple functions for a given asset where each function can experience multiple ways to fail. And each functional failure can have multiple failure modes, shown as a phrase containing the failed component and component problem. Note that if the cause was known, it also would have been part of the failure mode phrase. By parsing this failure mode phrase into three separate fields, then failure analysis within the CMMS is enhanced by using validated fields.

failure data mode

Figure 3: Parse the failure mode

Store Results of RCM Analysis Directly Inside the CMMS

Assuming the CMMS can be configured, it makes perfect sense to add this application. In order to set up and maintain a living program, this data is readily available by the reliability team to refine as they go. The RCM facilitator often stores the output in an Excel spreadsheet, which is a stand-alone document and not easily updated. The CMMS offers data security, a familiar screen design, and the ability to be joined to other tables such as asset, labor, and the work order failure mode. As new systems are analyzed, this data can be uploaded in bulk. As new failure modes are discovered, they can be electronically routed for review and approval, and then inserted into the new application. 

Create a Defendable PM Library

Imagine having a defendable preventive maintenance (PM) record, which is linked back to the RCM failure mode. By linking one (or more) failure modes to one PM record, you can then verify the validity of the entire PM library. And during this initial review, you might discover PM records that are no longer valid.

Figure 4: Defendable PM library example

Understanding the RCM Standard SAE-JA-1011/12

This standard identifies failure mode as the language of RCM. More importantly, it describes the failure mode as three separate elements. If asset management professionals are to understand RCM and the definition of a failure mode, this is where they should look. This standard states that a failure mode should consist of a noun and a verb. Further, if you want to apply the optimum maintenance strategy, you also need to know the cause. Every reliability team should understand this standard. More importantly, they should insist that the CMMS failure data have this level of granularity, so that they can run a bad actor report.

On RCM Blitz

This is a book written by Douglas Plucknette, which describes RCM analysis. It clearly defines the failure mode as three pieces – however, grouped together as a phrase. Douglas may be the best in the business when performing RCM analysis and discovering system defects, but he never tried to store this data inside a CMMS. Upon reading his book, it became obvious to me that this was an opportunity worth investigating.

Figure 5: An RCM standard

CMMS Product Design

Unfortunately, most CMMS products do not capture a failure mode as three separate fields. Rather, they concentrate on asset problem codes. To make matters worse, they introduce a failure code hierarchy which overcomplicates the design by segregating components to each failure class. On the surface, this sounds pretty clever. But if there were 50 total failure classes, 25 components, and 25 component problems, then there could be a total of 31,250 boxes in the hierarchy – and that’s not including the cause codes. This is the main reason why so many organizations have never succeeded at establishing a failure data library. And without the failure data, they cannot run failure analytics using aggregate commands. The most damaging outcome of this design failure is the inability of the reliability team to leverage (failure) data using a failure analytic in the CMMS to manage by exception.

Configuring the Product

All that is required is to add fields to the work order tracking screen, build a failure analytic to extract bad actors, and apply a sort metric. The first step is to identify when the failure mode is required. The answer to that is to add a Yes/No field called Functional Failure. Thus, if this field is flagged Yes by the operator, then the maintenance staff must enter the failure mode at job completion.

The failure mode consists of three pieces: failed component, component problem, and cause code. The failed component can be found on the ITEM master table, commodity field. By running a query against the (historical) material issues table, you can use the commodity field linked with each item/part number. Then, by running an SQL distinct command, you have an instant list of failed components. At this point, you can make a static domain or a dynamic table-based domain. Note that this is for ALL assets.

Although this failed component domain could be quite large, by using internet search technology called TYPEAHEAD BUFFER, you can easily find the component you want in two to three clicks. And since the maintenance technician did the component replacement (or repair), they obviously know what the component is. In some instances, the component they replaced may not be in the list, in which case there needs to be another field called MISSING COMPONENT. If populated, then this information is electronically routed to the reliability team for review. Upon their approval, the software would add the new value to the domain and also backfit the work order component field.

The component problem could be a simple, static list of less than 25 values. An example list is shown in figure 5. Thus, the primary key might be CALIB, and the description for CALIB would be MISS-CAL-STICK, OUT-OF-CAL, OUT-OF-SPEC, whereby the description would be searchable. Usually these problem codes would be entered by the technician, but may sometimes require a supervisor or reliability professional.

failure data - component problems

Figure 6: Component problems

Cause codes can be more challenging. And the list of cause codes could be quite extensive. Further, the question needs to be asked, “How far do you go?” The danger in not capturing any cause code is that the expensive bearing you just replaced could fail again in six to nine months because you never really eliminated the cause. These values could be entered by the technician but might require the maintenance supervisor.

Cause Code Hierarchy

This is one time where I do recommend the use of a hierarchy. The benefit of a hierarchy is to start at a higher level, and then work into human factors. Some asset management professionals state that the majority of equipment failures are mostly related to human factors. But this does NOT mean maintenance. Defects can be inserted into the asset anywhere in the life cycle by humans. Figure 6 shows all of the possibilities.

Figure 7: Defect chronology example

Using a Cause Code Hierarchy

As stated, the cause code may be the hardest piece to capture. But without this information, the person creating the PM job plan will be guessing as to how best to prevent this from happening again. Admittedly, sometimes the cause is not known. On that note, be sure the maintenance staff is cautioned about (accidentally) destroying evidence. In figure 7, there are three cause fields added to the work order failure reporting screen. Cause code 1 has a short list of values: NO-DEFECT FOUND, AGING, WEAR-AND-TEAR, POWER-SURGE, HOUSEKEEPING, ENVIRONMENTAL REASONS, FORCE-MAJEURE, and HUMAN-FACTOR. The technician would read through the list and only choose HUMAN-FACTOR if none of the above apply. And if HUMAN-FACTOR is chosen, then cause code 2 is required. Similarly, with cause code 2, if WORKMANSHIP is chosen, then cause code 3 is required. There could be additional fields to capture lack of skill, lack of standards, and lack of leadership. 

Keep in mind that cause code 3 could be dissected a lot further in the case of a formal root cause failure analysis. The purpose of this type of failure coding is to get somewhat close by capturing an RCM-style failure mode and preventing recurrence.

failure data cause code fields

Figure 8: Cause code fields

Automatic Comparison of Work Order Failure Mode to RCM Failure Mode

Because all of this data is now inside the CMMS, there are many new possibilities which help us continually improve strategies and failure modes. Since the RCM failure mode is stored as three pieces, and the work order failure mode is stored as three pieces, it is now possible to implement an automatic comparison. Examples are shown below:

  1. If the work order failure mode does not exist in the RCM application for that same asset, then route to reliability professionals for review.
  2. If the work order failure mode does exist, then also route to reliability professionals to “ask why this failure happened.” Perhaps the CMMS PM job plan is missing or incorrectly set up. Or, the maintenance technician failed to follow procedure. Or, the PM work order was not performed per schedule.

Chronic Failure Analysis

To me, this may be the #1 benefit of having proper failure coding. By designing a failure analytic (see figure 8), we can now extract bad actors. And by choosing the worst offender in the list, the reliability team can dynamically drill down on the failure mode. Some say the largest portion of operations & maintenance costs is due to these recurring failures. If so, shouldn’t we have a way to focus on them?

Figure 9: Bad actor report

Explaining the Sort Metric

You could sort the bad actors several ways. But choosing the asset with the most breakdowns is not necessarily the best way. There could be value in operational downtime, mean time between failures, and asset condition. But based on my experience, I believe the most powerful metric is the average annual maintenance cost divided by replacement cost – referred to as AA$ / RPL$. This metric states that any asset over 7% is an asset in trouble, meaning you are spending quite a bit on it compared to the replacement cost. This metric will evaluate thousands of assets and float the bad actors to the top, instructing the reliability team to focus here. Once chosen, they might perform a more detailed root cause analysis. This is called managing by exception.

Drilling Down on Failure Mode

The software experts say we can pretty much do anything nowadays. Well my challenge is this: develop a bad actor report for the reliability team to run, display on the overhead, allow them to choose a particular asset, and then see the failure mode in pie chart format. The first pie chart would be failed components. The team lead would click on the largest wedge, and then see the component problems, and so forth. Imagine the wealth of information in the hands of the reliability team to make data-based decisions.

Figure 10: Bad actor report with pie charts

Make the CMMS Work for You

This is how the core team and reliability team can work together to leverage failure data in the CMMS to make better decisions. More importantly, this is how a best-in-class organization can optimize return on asset and improve profitability. All that is required is a vision for excellence. 

Figure 11: Living program example


John Reeve
John Reeve is a Senior Consultant. With 20,000 followers on LinkedIn, he regularly shares knowledge on many topics in support of asset management. Being the 2nd consultant hired by the company that invented Maximo, he spent the first 10 years consulting in project scheduling and cost management, followed by 15 years on Maximo software. But it was the last part of his career (another 15 years) where he focused on advanced concepts resulting in a U.S. Patent for maintenance scheduling called the “order of fire.” John is also a CRL, CMM, and book author.

Asset Management Questions & Answers