OMDEC | Optimal Maintenance Decisions Inc.

IMEC 2005 – Speaker Notes and Slides

November 2, 2005, 14:00 – 14:30

University of Toronto

 

 

 

 

 

 Condition based Maintenance (CBM) demonstration

 

 

 

 

 

 

 

 

Presenter:

Murray Wiseman, VP

Optimal Maintenance Decisions (OMDEC) Inc.

www.omdec.com

Murray@omdec.com

 

 

 

 

 

 

 

 

Slide 1: 

This presentation describes an evolving CBM demonstration by Oceana Sensors, ABB, and OMDEC. In building this demo, we asked ourselves:

 

  • What do we really want to prove?
  • What should a CBM strategy target?
  • How should we measure its benefits?
  • How should we support and improve the strategy?

 

Bear with me, as I try to convey how the demo sheds considerable light on these essential questions.

Slide 2 

We state our objective simply. When faced with a set of operating conditions, a set of requirements for our asset, and a set of condition monitoring data, how, exactly, do we interpret what all this information is telling us. How do we make our decision as to whether: 

 

  1. to intervene immediately?, or
  2. to continue operation until the next observation? 

 

How can we know whether our decision process leads to the best decision - the so called optimal one? What does the word "optimally" really mean?

Slide 3 

Here is what we, in developing the demo, thought it means. Please tell me if you disagree. As our stated goal for the demo, and for CBM in general, we desire to declare a potential failure at the right time.  

 

That of course begs the question, “What is the right time?”.

 

The answer we came up with is "At the right level of risk" 

 

How to define risk?

 

Slide 4

Here is a text book graph (variations of which) you have all seen. The vertical axis quantifies the bottom-line cost  (per unit of production) of owning an asset. On the horizontal axis, we indicate some policy (scaled to risk) for deciding when to maintain the asset. We may consider the horizontal axis as a spectrum of varying risk. What's risk? Most people see risk as the combination of the probability that something is going to go wrong, and, the gravity of how bad it will be. 

 

Generally speaking, we can be conservative in our CBM data interpretation policy, panicing too quickly at the slightest increase in a monitored variable. Looking solely at the issue of cost, that policy, gravitating towards the left end of the spectrum, is expensive. 

 

On the other hand we can run our CBM program liberally, throwing caution to the winds, and setting our action limits high. That too will be costly. As a matter of fact, it will approach the cost of having no CBM program at all. 

 

Some point on the risk spectrum, therefore, should correspond to lowest cost. If cost were our main consideration, then we want to set our CBM data interpretation policy at that point. 

[Animation: Availability] 

We know, however, that cost may not be the only consideration. Availability can be more important than cost in a production constrained environment. And, in general, maximum availability does not coincide with minimum cost. 

 

[Animation: Reliability] 

But cost and availability may not suffice to specify our policy for an asset in its operating context. We may require to specify a survival probability during an upcoming mission. A CBM interpretation policy that targets survival probability as an objective, will not, generally, coincide, with our maximum availability or with minimum cost CBM policies. The 64$ question is, then, given our current operating context (our policy objectives) “where on the risk spectrum do we want to operate?”.  

[Animation: Data interpretation policy] 

And the 65$ question is “How do we set our potential failure declaration rule to respect that policy?

Slide 5 

We usually need help to answer those two questions. Help can take the form of an agent. An agent applies our stated objectives to the current data set. It interprets (formulates decisions based on) the CBM data in accordance with our objectives. 

 

The agent could be a computer algorithm, or a human applying a set of rules based on experience or written down in a standard operating procedure. It monitors data from the CBM and other operating databases. It takes into account maintenance and process events that have occurred recently. With due consideration of all relevant factors, it issues the "right" decision.  

 

Now we are getting closer to defining what we wanted our demonstration to prove.

 

Slide 6

Here is the demo hardware. An obvious challenge to building a CBM demo is the design of the failure mechanism. On one hand we wanted the failure to behave realistically. On the other, we did not wish to injure anyone or cause excessive damage.  

 

After considerable trial and error, we decided upon the failure mechanism in the photo on the far right. 

 

The tee component is held in position by friction applied by the perpendicular compression spring. The friction force resists the load exerted by the lateral extension spring.  

 

Failure in this case is the loss of the function: “to hold the tee in place against a spring force in the presence of vibrations induced by an unbalanced rotor”.  

 

At failure the tee strikes the golf ball sending it down the ramp where it activates a switch indicating that functional failure has occurred. To initiate a test run the operator inserts the tee into the spring-loaded holding device and places the golf ball into position. He powers the motor drive, and hits

“Begin life cycle” on the console.  

 

An intelligent agent (not quite as intelligent as agent Smith[1]) applies a proportional hazard model to the data in near real time, and issues an “optimal” maintenance decision.

At that time, a more intrusive (visual) inspection will confirm an impending (i.e. “potential”) failure. Consequently, the equipment may be stopped and PM performed[2], thus avoiding a functional failure – the goal of CBM.

Slide 7 

At each CBM inspection, a decision is displayed (blue box) as an updated residual life estimate (“0” in this case) and optimal recommendation (“replace immediately” in this case).  

 

The white box continuously updates the KPIs that evaluate the effectiveness of the active CBM policy.  

 

The graph on the top right is the hazard rate graph. The hazard rate graph, also known as the conditional probability of failure graph (hazard is proportional to the conditional probability) tells us what the probability of failure is at the time during the life cycle when you ask that question. The dark blue curve is the hazard rate for both potential and functional failures. The pink curve represents the potential failures alone. The closeness of these two curves tells us that the CBM policy applied by the agent is effective. That is, our CBM policy is detecting most potential failures in time to pre-empt functional failures. 

 

This is exactly what we want our CBM policy and program to do. But at what cost? We could be operating way to the left on the risk spectrum, that you will recall from slide 4. 

 

Shifting our attention back to the KPI report, we note the total cost of maintenance (both preventive and reactive) per unit of working age (or per unit of productions). We also read the savings compared to a  laissez-faire policy.

 

Additionally we note the savings compared to some original policy (prior to adopting the current model). These KPIs give us some idea of how optimal our CBM policy is. We'll pursue that theme in a moment on slide 8.

 

But before we do, look at the bar chart on the bottom right. It tells us how good our residual life estimates have been. At every inspection the agent produces an estimated time to failure (functional or potential). If we plot the errors in those estimates for all inspections we get the histogram shown. It tells us that most (412) estimates were less than 10% off. Approximately 5 % of our estimates were as much as 80% off. The point is, that we have demonstrated a method for evaluating the accuracy, or predictability of our CBM program (also known as a “predictive” maintenance program). All CBM programs should monitor their own effectiveness!

Slide 8 

How does the EXAKT predictive agent work. It applies a model (called a proportional hazard model) that was constructed from past patterns of data. When a functional or potential failure occurred, it was recorded. Those events were correlated with monitored CBM data to provide a risk model. A transition probability analysis provides a predictive model of the monitored variables. Finally, economic considerations are blended into the model. A model is a yardstick that assists users in making decisions. 

 

In this case our “yardstick” is the green, yellow, and red graph of Slide 8. The position and shape of the boundaries between “good” and “bad” depends on failure probabilities, on cost or repair time proactive-to-reactive ratios, and on our organizational objectives for the asset.

 

In this case the objectives have been set to minimize cost. We see that the cost of applying the proposed model is 65 cents, which is a savings of 51% over some previous policy.  

 

However, nothing is free. Look at the mean-time-to-replacement column. It is only 1791 compared to 3944. This means that we have achieved lower cost at the price of more frequent maintenance. That policy is optimum, but only with respect to the objective of cost. 

The point is that the model allows us to set whatever compromise, among cost, availability, and reliability appropriate to our current operating context. We may, thereafter, operate (or have our agent operate) our CBM program accordingly.

Slide 9  

Where does an optimal decision model come from? It comes, usually, from the data of past experience. The CMMS historical database holds valuable data. This is particularly true when work orders have been documented by the technician using the principles of "living" RCM. 

 

Here is an example of a well documented work order. Notice the inclusion of the 5 RCM knowledge elements:

 

  1. What function was lost, compromised or threatened?
  2. In what way - (including whether the loss of function was total, partial, or potential)?
  3. What was the cause (at a practical depth in the causality chain)?
  4. What happened - (the sequence of relevant events preceding, during, and following the failure)? and
  5. How did/could it matter (consequences to the user/owner/society)? 

 

These knowledge elements are key to good data for reliability analysis.  

 

 

Notice the field RCMREF. It has been filled with a number that refers to a record in the RCM knowledge base. This work order, then, is an instance of an  already known item-function-failure-cause. Hence the values for the 5 knowledge elements may be referenced rather than duplicated on the work order form.

Slide 10 

Work orders, documented using the principles of living RCM, permit EXAKT’s work order processor (EWOP) to generate an Events table such as this one. Note that each event is an instance of an RCM record. Reliability engineers and Weibull experts among you will quickly recognize this table as the ideal fuel of reliability analysis (RA).

 

Note the Event code, e.g. E836FF. “E” denotes that this is an “ending” event. “836” refers to a record in the knowledge base for an item-function-failure-cause. And “FF” means that this was a “functional” failure. Hence the failure “code” contains the precise information required for RA. Contrast this method with  the failure codes used in the drop down lists of the typical CMMS work order. Data compiled from such lists provide little value for RA.

Slide 11 

Note, we are making a clear distinction between reliability analysis (RA) and reliability-centered maintenance analysis (RCMA). The former is the study of what did happen while the latter is the study of what could happen. We are proposing, in essence,  that, using the EWOP, the two processes cross-fertilize and populate a common knowledge base – a RCM knowledge base. RA and RCMA, using the methodology of the EWOP, continuously improve that knowledge base. The process augments and refines our comprehension of the failure behavior of each of our significant phyiscal assets.

Slide 12

The EWOP generates the Events table. The Events table provides the data in a perfect form ready for any type of reliability analysis software or procedure – for example, Weibull, EXAKT, Jackknife, Pareto, and many others.

 

Further, the EWOP appends new knowledge to the RCM knowledge base. Where a technician has discovered an item-function-failure-cause that the original RCM FMEA analysis had not anticipated, he performs that FMEA directly within the work order, on the spot when the relevant observations are immediately before him. The EWOP transfers that knowledge from the work order to the RCM table. A supervisor, a reliability engineer, or anyone versed in the RCM “language” and with adequate knowledge of the equipment, audits the new records. He discusses the proposed new record with the originator, so that a conscensus is reached on the content and level of detail appropriate.

 

 

Slide 13 

How does the growth and exploitation of the knowledge base support our global maintenance strategy. Most maintenance experts and consultants agree that KPIs assist the physical asset manager in knowing and achieving his targets.  

 

Let's look at one KPI as an example, quality loss. Quality loss may be due to process or machine malfunction. We, in maintenance, are concerned, primarily, with the latter. 

[Animation: CMMS] 

We can drill down from the KPI using our, now, well documented CMMS. 

[Animation: Reliability analysis software] 

With our rich data source, we may call upon scores of reliability analysis software tools to study and draw conclusions from our database. 

[Animation: Improved maintenance policies] 

The output of those analyses and studies are, without doubt, Improved maintenance policies. Improved polices result in improved KPIs. And so the cycle is closed. 

 

Most maintenance information systems today lack adequate emphasis on the dashed box on the left - the CMMS historical database populated via living RCM. The EWOP approach enables and encourages the enrichment of that important intellectual asset. An EWOP-enabled CMMS will drive continuous  reliability improvement.

Slide 14 

Here is another way of expressing the continuous improvement cycle.  In physical asset management we strive to find a strategy that results in a desired performance. 

[Animation: Maintenance Policy] 

Our strategy embodies our maintenance policy that includes our mix of reactive and proactive tasks and the policies whereby we carry out those tasks (for example, a specific CBM data interpretation policy).

[Animation: KPIs and Reliability Analysis] 

We measure the results of our maintenance policies with KPIs and drill down with reliability analysis tools. Those KPIs should have been set to meet our organization’s performance requirements. Those requirements fall into two categories 1. The balance sheet, and 2. Social responsibility 

[Animation: feedback loop 4] 

If we aren't meeting Performance we need to adjust our KPI targets. 

[Animation: feedback loop 3] 

And if we're not meeting our KPI targets, we need to adjust our maintenance policy. 

 

Understanding the detailed nature of the relationships,

 

  1. How does policy impact KPIs?
  2. How do KPIs relate to corporate performance?
  3. How shall KPI results and analyses guide changes in policy? and
  4. How shall performance shortfalls guide changes in KPI targets? 

 

I submit to you, is the essence of effective physical asset management. 

 

Thank you for your attention.

 



[1] The Matrix, 1999

[2] By pushing the tee back to its starting position, thus renewing the asset with respect to this failure mode.

OMDEC | Optimal Maintenance Decisions Inc.
  www.omdec.com   info@omdec.com