Optimal Maintenance Decisions Inc.
|
Asset Reliability
Lexicon
|
|
Lexicon
Age Exploration:
Any analysis procedure that examines historical data in order to improve the maintenance plan by increasing an item’s reliability, availability, maintainability, productivity, or by reducing cost. (Also called
reliability analysis.)
Applicable:
A task is technically feasible and practical. For a condition based maintenance task it means that a
potential failure can be detected and assessed well enough in advance of a functional failure to avoid or reduce its consequences. For a scheduled overhaul it means that the item has a
useful life. For a failure whose consequences are
hidden an applicable failure-finding task must detect in a practical way the failure of a safety or backup system. (To be
effective, the task must reduce the probability of a
multiple failure to the required level.)
Availability:
(total scheduled time – downtime)/total scheduled time. Or,
MTTF/(MTTF+
MTTR). Or, Uptime/(Uptime + Downtime)
- acquires condition data,
- extracts features, and
- issues a decision whether to:
- perform maintenance (possibly a more intrusive inspection) immediately, or
- plan to perform maintenance (or a more intrusive inspection) within a specified future time period, or
- continue operation until the next CBM observation.
Combined analysis
Often condition monitoring data from different test methods conducted at different times and frequencies is available. All of this data is considered potentially relevant to the failure modes of an equipement. That is, a single CBM decision model may include variables from different condition monitoring technologies (vibration, oil analysis, process control, visual observation, and so on). The method to deal with this situation is called "Combined Analysis". Combined analysis pre-processes condition monitoring data in order to harmonize the data chronologically regardless of the variety of timestamps associated with the readings. The software uses one of two methods to fill the data gaps created in these situations. Simple interpolation is the usual method. However, a more intensive numerical method is also available where required.
Common cause failure (CCF)
A failure cause which results in several failures. The failures may become evident at different times and locations. Some failures may be potential failures. Examples: A mistake in a maintenance testing procedure may result in leaks over a period of time at different locations. A common component in a rundandant safety system fails.
A CCF may be a
random or
systematic event.
Complex item:
An
item subject to more than one reasonably likely failure mode.
Condition (monitoring) data:
Inspection/measurement data (temperature, vibration, wear, yield, visual observation, performance, etc) from which a
potential failure may be deduced. Also known as "condition monitoring" or "CM" data. We often need to apply
signal processing to condition data in order to convert it to one or more
"condition indicators". Condition indicators are also known as
"extracted features", "
covariates" or
significant variables.
There are two main types of condition data:
- Internal variables that reflect the state of the equipment's health, and
- External variables that record the stresses imposed on the equipment.
Conditional probability of failure
It is the probability of failure in an interval given that the item survives to the start of that interval. The interval must be small compared to the
average life of the item. In the limit, as the interval becomes very small, the conditional probability divided by the interval approaches the
failure rate, or the hazard rate. See the
EXAKT forum for an amusing story by John Moubray on confusing the conditional probability of failure and the failure rate.
The conditional probability of failure graph may be drawn to represent the age-reliability relationship at any level in the asset hierarchy. It is most usefully drawn at the
failure mode level. This will expose the failure behavior at a practical depth, where the organization can apply a policy that will mitigate the consequences of the failure.
Consequences
Why a failure matters. There are four (categories of) consequences:
- Hidden (operating personnel are unaware that a function has been lost)
- Safety, health, environmental
- Operational (impacts customer service, quality, delivery, and/or direct costs)
- Non-operational (impacts only the maintenance budget)
Control limit:
A measurement or calculated value that sets a preventive maintenance policy. It is the potential failure point. In EXAKT it is the optimal level of risk that will achieve some long run objective (for example: cost, survival, availability).
Covariate:
A condition indicator. A condition data variable or transformation of one or more variables to be tested in a proportional hazard model for
significance. Its precise relationship to a failure mode's probability is determined. That relationship is then incorporated into a decision making rule or
decision model.
Cube
An
OLAP cube is an arrangement of data in arrays to allow fast analysis. Cubes avoid a limitation of relational databases which are not well suited for near instantaneous analysis of large amounts of data.
OLAP cubes can be thought of as extensions to the two-dimensional array of a spreadsheet. For example a company might wish to analyse some financial data by product, by time-period, by city, by type of revenue and cost, and by comparing actual data with a budget. These additional methods of analysing the data are known as
dimensions.
A financial analyst might want to view the data in various ways, such as displaying all the cities down the page and all the products across a page. This could be for a specified period, version and type of expenditure. Having seen the data in this particular way the analyst might then immediately wish to view it in another way. The cube could be re-oriented so that the data displayed now has periods across the page and type of cost down the page. Because this re-orientation involves re-summarising very large amounts of data, this new view of the data has to be generated efficiently to avoid wasting the analyst's time, ie within seconds, rather than the hours a relational database and conventional report-writer might take.
There can be more than three dimensions in an OLAP "cube". The technical definition of an OLAP cube is that it is an abstract representation of a
projection of an RDBMS relation.
We analyze two main types of data,
age data (aka "life data" or "event data") and
condition data. Each contains a number of sub-categories:
- Age data. Events that mark the date, working age, and circumstances of:
- The beginning of a life-cycle, and
- The ending of a life-cycle, either by:
- Failure, either:
- Potential, or
- Functional, or by
- Suspension, and
- Condition data of which there are two main types:
- Measurements and inspection results, and
- Process data, consisting of:
- External variables, and
- Internal variables
Data analysis
Often called "data cleaning" or "data cleansing". It is the investigation and explanation of anomolies and trends. Data analysis
precedes reliability analysis. It requires synchronized views of multiple data sources. The BI-Cycle report shown here unifies three data sources:
- real time monitored data (upper and lower line graphs),
- work orders (middle graph indicating PF, FF and S events), and
- reliability knowledge (RCM) records.
Hitting an event point on the middle graph drills through to the Events table, the work order, and the RCM knowledge record. Examining events and CBM data in a single view is the initial step to understanding failure behavior and its relationship to monitored data. Through such meticulous investigation with the help of dynamic data analysis tools, the reliability engineer ensures that subsequent reliability analysis will be based upon clean and well understood data.
Data Mart:
An open database system that effectively stores engineering data, maintenance work orders, preventive maintenance programs, production data, process data, spare parts, Health, safety, and environmental incidents and more. The LRCM BI-Cycle Data Mart can be implemented in the following Business Intelligence Solutions:
- BI Data Warehouses: Oracle Enterprise, Microsoft SQL server and SAP BW
It has standard extractors for the following source systems:
- ERP systems: SAP, IFS, Indus and others
- CMMS systems: Maximo, Datastream and others
Data reduction
A form of CBM signal processing for summarizing real time (or frequently acquired) data for use in condition based maintenance decisions. Data reduction is required when data acquisition intervals are much smaller than the practical observation interval for CBM decision making. The goal of data reduction is to extract, faithfully so as not to lose significant information, a summary of data over a practical period of time. The summary is then used as a CBM condition indicator or variable, for example in an EXAKT decision model.
Decision Model
A method for interpreting
condition data, age data, and any other data thought to reflect the state of health of a physical asset. An optimized decision model is one which maximizes or minimizes some objective (e.g. availability or cost respectively). An optimal decision model may be developed that achieves some performance measure such as a specified mission reliability or a required preventive to corrective maintenance ratio.
Defense wide area network (DWAN)
Wide area network at the Canadian DND (Department of National Defense).
Diagnostics
We "diagnose" a system in order to locate the cause of a functional loss, aka a failure, that
has already occured. Starting with a symptom, such as an error message, or a more obvious loss of capability of the asset, the troubleshooter must locate the cause of failure in the shortest time and at least cost.
If a functional failure has not yet occured, but some symptom or "condition indicator" signals that function loss is imminent, we consider the detection of the
impending failure a "prognostic procedure". Advancing technology sometimes blurrs the distinction (in the minds of the users of the technology) between diagnostics and prognostics because the same technology may serve both purposes. Nevertheless, it is worthwhile and
best practice to take a moment to indicate whether the event was a
PF or a
FF when closing the work order.
Dimension reduction
A form of CBM signal processing for extracting the most useful CBM indicators from a large number of possble significant indicators that are made available usually by real time on board data acquisition systems. Principal component analysis is one of a number of dimension reduction techniques.
Effective:
A task is worth doing because it accomplishes the intended objective – to lessen satisfactorily or to avoid entirely the consequences of a failure.
Effects:
The
Effects attribute in a RCM knowledge record describes:
• the sequence of events (component level to organization) could be touched off by the failure mode?
• how does the failure make itself known? What observable events lead up to the failure?
• how is safety or the environment impacted? (without mentioning the words "safety" or "environment")
• how is production impacted? (quality, cost, customer service)
• is there any additional damage caused? Are there currently any mitigating circumstances or tasks?
• how long will it take and what actions must be accomplished to correct the failure?
• How does the likelihood of this failure depend on deeper causes? Has it happened before? How often? Under what circumstances? How likely or unlikely is this failure mode considered to be?
Equipment health monitoring (EHM)
Also called CBM, On-condition maintenance, Condition monitoring, CM, Predictive maintenance, PdM, prognostic health management, PHM. See
condition data.
ETL:
"Extract Transform Load" - a data warehouse methodology for accessing multi-source data. LRCM on the BI-Cycle platform enables us to "transform" all relevant information within a "data mart" according to rules based on "meta" (i.e. descriptions of) data. The transformation, embodied in a data mart can be quite complex. They may include selective loading, translating coded values, encoding free form values, deriving new calculated values, joining together data from multiple sources, summarizing and grouping, pivoting, and splitting columns into multiple columns. Therefore we require good ETL software to make the transform step simple and systematic for an analyst or reliability engineer to design and execute. LRCM, based on the BI-Cycle engine, contains such a mature flexible tool in daily use in over 75 utilities, process, and power generation companies. The LRCM system integrates fully with all major CMMSs and EAMs, SAP PM - Business Warehouse - Netweaver, and OSISoft PI Data Access Pak.
Event type
Usually one of
PF,
FF, or
S indicating the way in which a component or failure mode's life cycle ended. By including event types and RCMREFs as attributes of significant work orders we make
reliability analysis possible.
A table generated from the CMMS by the LRCM process operating on the BI-Cycle platform that presents work order data in a form that is easily handled by
reliability analysis algorithms such as those of EXAKT.
A software system consisting of a module for creating optimized CBM decision models and a module for deploying those models in an intelligent agent. The agent module, called "EXAKT for Decisions" silently monitors condition monitoring data and returns
optimal recommendations and remaining useful life estimates (RULE).
Failure:
Two types:
- Potential failure (PF) – an unambiguous indication that a functional failure is imminent (degraded failure resistance). A potential failure may be declared at the moment of detection by a CBM inspection. However, it requires physical confirmation (that failure was indeed imminent) prior to recording this event as a PF on the work order. A potential failure will eventually, if not repaired, deteriorate to a functional failure. By definition, a potential failure has relatively minor consequences compared to those of a functional failure, and
- Functional failure (FF)– the partial or total loss of one of an item’s required functions, thereby provoking consequences to the organization
In an EXAKT model, either type of failure is mapped in
marginal analysis to an "
EF" event.
Failure analysis
The term "failure analysis" is used in root cause forensic and data analysis to discover the "latent root cause" of a significant failure that
has occurred. We use the term "failure analysis" differently. In RCM (and the living RCM process) it means the specification of the manner and extent to which a function
can cease to perform as required.
Many maintenance professionals who seek information with which to improve
OEE seize upon the opportunity to record failures in the form of “failure codes”. Such short descriptive acronyms or phrases appear as a “good first step” towards acquiring useful
knowledge about failure behavior. Ideally, failure codes should present themselves in the form of configurable, context sensitive drop-down lists (or check boxes) for convenient, yet accurate, failure classification. The maintainer selects a failure code while completing the work order form.
Failure codes, unfortunately, seldom realize their users' (maintenance assessment / analysis / improvement) expectations.
Pick lists of maintenance failure codes are often difficult to choose from and prone to error. The selection items are often too general or do not adequately fit a given situation. Or, alternatively, long lists of precise codes suffer from “choice overload” resulting in the overuse of the default “Other”. Without doubt, effective and accurate lists are the ultimate objective of reliability-centered knowledge systems. But deciding what selection choices to place on such pick lists is no trivial matter. Some intermediary process is required that will facilitate the day-to-day recording of useful reliability knowledge in the short term, but additionally, must eventually evolve to the provision of accurate, robust pick lists. LRCM addresses the problem of failure code development and takes an approach that is reasonable, simple, robust and progressive. That approach, elaborated in
Workshops 1 and 2, will
unify, in a continuous knowledge refinement process, the failure mode records in the RCM worksheet (knowledge base) with the
failure codes in the work order database.
Failure management policy
A failure management policy is a decision, flowing from a RCM analysis, that addresses, to the users' satisfaction, the consequences of a failure mode. Applying the
RCM decision algorithm will lead to the selection of one of the following failure management policies:
- Condition based maintenance (CBM),
- Time based maintenance (TBM),
- Two of the above,
- Failure finding (protective / backup devices),
- Run-to-failure,
- Redesign.
Failure mode
The event that causes a
failure. It is expressed in the RCM knowledge base at a practical depth in the causality chain. It is also known as the "failure mechanism" or "failure cause". Often, it is practical to consider the failure of a component as a single (i.e. dominant) failure mode. In those cases seeking additional granularity or depth of causality is not worthwhile.
Failure rate
The term
failure rate often means the inverse of the average life (mean-time-to-failure.) The failure rate
function is at various times called the hazard function, failure rate, hazard rate, conditional failure rate, instantaneous failure probability, instantaneous failure rate, local failure rate, or the
conditional probability of failure.
h(t) = f(t)/R(t)
where h(t) is the failure rate, f(t) is the probability density function, and R(t) is the survival function.

Often, the two terms "conditional probability of failure" and "hazard rate" are used interchangeably in many RCM and practical maintenance references. In those references the definition for both terms is:
| the conditional probability that an item will fail during an age interval given that the item enters (or survives to) that age interval. |
This definition is not the one usually meant in reliability theoretical works when they refer to “hazard rate” or “hazard function”. Nowlan and Heap point out that the hazard rate may be considered as the limit of the ratio (R(t)-R(t+L))/(R(t)*L) as the age interval L tends to zero. (This equivalence is derived below.)
To summarize, "hazard rate" and "conditional probability of failure" are often used interchangeably (in more practical maintenance books). The “hazard rate” is commonly used in most reliability theory books. The conditional probability of failure is more popular with reliability practitioners and is used in RCM books such as those of N&H and Moubray. There are two versions of the definition for either "hazard rate" or "conditional probability of failure":
h(t) = f(t)/R(t)
eqn. 1
h(t) = (R(t)-R(t+L))/R(t)
eqn. 2
where L is the length of an age interval.
We can derive the failure rate (hazard) function (eqn. 1) from the conditional probability of failure (eqn. 2) by dividing eqn. 2 by L and letting L tend to 0, as follows:
Since F(t)=1-R(t)
Then differentiating
f(t)= -dR(t)/d(t)
Dividing both sides of
eqn. 2 by L (so as to convert it to a rate) and letting L tend to 0 (and applying the derivative definition of a limit), and substituting the above equation for f(t)
Lim
R(t)-R(t+L) = (1/R(t))( -dR(t)/dt) = f(t)/R(t)
L->0 LR(t)
Note that, in the second version, t is not continuous as in the first version. For example, you may have t=0,100,200,300,... and L=100.
Actually, not only the hazard function, but pdf, cdf, reliability function and cumulative hazard function have two versions of their definitions as above. The first version is defined over a continous range of age t while the second one is defined over discrete age intervals, e.g., (0,100), (100,200), (200,300), ... Roughly, we can say the definition of
eqn. 2 is a discrete version of the first definition (
eqn. 1).
The definition of
eqn 1. is useful in reliability theory and is mainly used for theoretical development. The definition expressed by
eqn. 2 is useful for reliability practitioners, since in practice people usually divide the age horizon into a number of equal age intervals. The pdf, cdf, reliability function, and hazard function may all be calculated using age intervals. The results are similar to histograms, rather than continous functions obtained using the first version of the definitions.
An interesting anecdote on the confusion between failure rate and conditional probability of failure can be seen in the
forum.
Fault finding or failure finding
The detection of the loss of a
hidden function. This definition contrasts subtly with that of "
diagnostics" or "prognostics".
Functional analysis
Determining the functions of an Item under RCM analysis.
Functional modeling
Determining the
#significant functions of an End Item in preparation for subsequent RCM analyses of the significant items.
Functional safety
Safety is freedom from unacceptable risk of physical injury or of damage to the health of people, either
directly, or
indirectly as a result of damage to property or to the environment.
Functional safety depends on a system or equpment operating correctly in response to its inputs. See
safety instrumented function
An over-temperature protection device, using a thermal sensor in the windings of an electric motor to de-energise the motor before they can overheat, is an instance of functional safety. However providing specialised insulation to withstand high temperatures is not an instance of functional safety (although it is still an instance of safety and could protect against exactly the same hazard).
Hidden
Used as an adjective in
hidden failure,
hidden function, and
hidden consequences. All three expressions have the same meaning. They refer to a function whose failure has no
direct consequences. The consequences of a hidden failure are precisely that - "hidden". That is to say, operating personnel in the normal course of their duties will be unaware that a significant failure has occurred. The term "hidden" refers mostly to
protective or
backup devices including "
voting systems". Unless we dedicate a repetitive task or procedure, at an appropriate frequency, that will detect the failure of a protective device, a
multiple failure will eventually occur.
Safety instrumented systems are hidden functions.
History
A life cycle of an
item. A
sample comprises several histories. Sequential histories are given a "history number" in chronological order by EXAKT.
History transformations
History transformations provide the ability to use a variety of time processed variables in EXAKT models. The basic history transformation functions are:
- First(var) , Last(var) , Diff(var) , Rate(var, timevar) , RateD(var, timevar) , Cum(var) , CumRate(var, timevar) , NonDecr(var), SmoothAve(var, timevar, const) , SmoothLWAve(var, timevar, const) , SmoothQWAve(var, timevar, const), SmoothLin(var, timevar, const) , SmoothLWLin(var,timevar, const), SmoothQWLin(var,timevar,const), Smooth(var, const)
Some discussion of history transformations can be found on the
EXAKT forum.
Inspections:
Observations (physically (human senses) or electronically acquired) related to an item’s operation and maintenance from which a
potential failure may be detected or predicted and a remaining useful life (RUL) estimated.
CBM inspections tend to be less intrusive. Inspections vary in their degree of intrusiveness. Overhauls are the most intrusive form of inspections. Maintenance managers desire inspection programs that derive the most useful information on asset health at lowest cost.
IDEF
Integrated Definition Languages were developed by the U.S. Air Force related to IISS (ntegrated Information Support System). The intent of the IISS efforts was to create 'generic subsystems' which could be used by a large number of collaborating enterprises, such as U.S. Defense contractors and the armed forces of friendly nations. RCM uses IDEFØ in functional modeling in order to illustrate the inter-relationship of assets within the End Item.
IDEFØ is a method designed to model the decisions, actions, and activities of an organization or system. The basic syntax for an IDEFØ model is shown in the figure. The method gradually exposes detail. The major functions are at the top and have successive layers of subfunctions. A "node chart" provides a quick index for locating details within the hierarchic structure of diagrams.
Item
A group of one or more parts or assemblies that is convenient to treat as a single entity for
reliability analysis. Items are defined at a high enough level of indenture so that their failures may be clearly related to failure consequences of the equipment as a whole and low enough so that the number of failure modes in the item is manageable. Should the number of failure modes become too large, a sub-component of the item can be "broken out" and analyzed as an
item in its own right.
Knowledge
Knowledge in maintenance is the ability to make good decisions and to verify that the decision process is optimal.
Knowledge base
Variants are "RCM knowledge base", "reliability knowledge base", "FMEA", "FMECA", "SAE JA1011", "HAZOPS", "OREDA ISO 14224", and many others. It is a structured compilation of knowledge about failures, failure modes, effects, and consequences. A
living reliability knowledge base is one that is referenced, incremented, and updated within the routine work order closing process. Work orders that reference a knowledge base become
instances of knowledge records (i.e. occurrences of failure modes described in knowledge records).
Reliability analysis counts those instances, in a variety of ways, in order to draw conclusions and create models of failure behavior to be used in optimizing maintenance decision strategies and tasks. Those “tasks” (
RCM questions 6 and 7) will be updated whenever reliability analysis suggests a failure behavior different from that conjectured in the initial RCM analysis, or when the operating context has changed to the extent that the consequences of failure have changed.
Living RCM (LRCM) is a process wherein work orders represent knowledge records and are occurrences (instances) of failure modes. This link between the work order system and the reliability knowledge base facilitates subsequent
reliability analysis of the
instances (i.e. occurrences) of failure modes. LR contrasts starkly to a traditional reliance by maintainers on
failure codes for this purpose. The living RCM process requires that a
significant work order, prior to closure, contain two specific information elements. They are:
- a reference to the relevant RCM knowledge record, and
- the life ending event type (usually one of PF, FF, or S).
An automated software procedure uses the combination of these two elements to generate an
Events table. This table, together with the
Inspections table constitute the
sample for reliability analysis processing.
In the living RCM procedure, the maintainer, planner, engineer, or supervisor, prior to closing the work order, looks up the RCM record (in the RCM knowledge base) that covers the current situation. If he finds no appropriate RCM record describing the failure mode or if the RCM record is incorrect or incomplete, he will propose an update to the knowledge base. Subject to a quality control procedure the knowledge base is duly updated by a designated verifier or RCM facilitator. Implementing a living RCM process in a maintenance department requires a careful and stepped approach. See also
LRP (Living RCM pilot).
Synonymous with "Living RCM Pilot". It is a pilot project with the objective to test and evaluate the novel procedures of
Living RCM. Pilot testing in the field permits the initiative to perform in a real-world setting, influenced by random factors and subject to conditions not included or even foreseen in the laboratory or design office. A pilot at an operational location permits the intended users to participate in the new process under their own terms and in a familiar setting. However, the pilot test environment should still be more controlled than actual operations. The following are among the elements of control:
- A comprehensive test plan structure should be followed.
- Equipment or classes of equipment to be analysed should be carefully selected.
- Test activity and results should be tracked and fully documented, including user comments.
- Input and output test data should be screened, with out-of-tolerance data clearly identified.
- The implementation team should be well trained and supported with hands-on oversight by the contractor, OEM, or software developer as appropriate.
- A specific pilot test timeframe and ending date should be established.
Complete records of the activity and results of the pilot test must be maintained to ensure technical capabilities work as intended, and that cause-and-effect actions result in desired outcomes. Initial LRP results will be the attainment of desired levels of intermediate KPIs:
- Number of new knowledge records added
- Number of corrections to existing knowledge records
- Number of references to the knowledge base on work orders
- Number of analyses performed
- Number of recommendations as a result of the analyses?
- Number of recommendations implemented
- Number of instances of actual improved performance related to the living RCM program?
- Availability
- MTBF
- Reliability trend
- Cost of maintenance per unit of working age
- PF/(PF+FF)
- Documentation of pilot test results also helps assess whether the maintenance actions determined through reliability analysis are the most appropriate for the tested equipment or component.
A pilot project delivers a plan, cost, and risk analysis prior to roll out of the new procedures and technology.
Management system
A program or activity involving the application of management principles and analytical techniques to ensure the safe and reliable operation of process equipment.
Marginal Analysis
A method used in EXAKT for CBM decision modeling of
complex items. The software maps life events for individual failure modes (as specified in the
Events table) to the "B", "ES", and "EF" events used in the proportional hazard modeling calculation. Marginal analysis assumes that no mutually causal relationships exist among separate failure modes. (I.e. statistical independence of failure modes.) This
exercise describes the marginal analysis method.
Mean time to failure (MTTF):
The average life of an item. It is also known as the "expected" life of an item. It can be estimated by totaling the lives of an item or fleet over a period of time and dividing by the number of items. See the
EXAKT forum for the mathematical definition of MTTF.
Mean time between failure (MTBF):
The MTTF plus the MTTR.
MTTF and MTBF are often used to describe the
reliability of an item.
Mean time to return to service (MTTR):
The mean time to return to service. (Also called the maintainability.) Includes the diagnostic time, materials procurement time, manpower mobilization time, adjustment and set up time, and repair time. The
availability can be expressed in terms of MTTR and MTTF.
Minimum preventive maintenance time
This is an option in EXAKT which may be used when setting the decision parameters. Sometimes there is a short period of time at the beginning of a life cycle where the mechanical components are "bedding in" or "wearing in". During this period monitored variables, such as wear metals may be abnormally high. This would cause the hazard as calculated by the PHM to be high. Yet the model should not return a potential failure alarm during this transitory period. By setting this parameter, we avoid false alarms during the time of bedding in.
Multiple Failure:
A failure of a protected function at a time when its protective function is already in a failed state.
Non-rejuvinating event
A maintenance event that does not zero-time or roll back the accumulated working age of the asset, component, or failure mode under analysis. Instrument calibrations, oil changes, adjustments, alignments, etc may be thought of as non-rejuvinating. They may, however, ''artificially" reset condition monitoring variables. Where a CBM program interprets data impacted by such events, they should be recorded in the CMMS. Those events will be accounted for correctly in an EXAKT model.
Null hypothesis
The null hypothesis proposes something initially presumed true. It is rejected only when it becomes evidently false, that is, when the researcher has a certain degree of confidence, usually 95% to 99%, that the data do not support the null hypothesis.
OEE: (Overall equipment effectiveness)
(Availability x Productivity x Quality) tracks maintenance effectiveness, where: Availability = (scheduled time - downtime due to all forms of maintenance)/(scheduled time). Productivity = Product rate setting/Desired product rate. Quality = (Product - Scrap)/Product. Additionally, tracking reliability (
MTTF), will provide further insight and benchmarks for maintenance effectiveness. Software must permit drilling through from the KPI to the knowledge record and its instances (i.e. the work orders) if an improvement strategy is to be obtained from the analyis.
OLAP:
Acronym for Online Analytical Processing. It is an approach to provide answers quickly to analytical queries that are multidimensional in nature. OLAP is part of a broader category business intelligence data warehouse information technology, which also includes Extract transform load (
ETL), relational reporting and data mining. The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas. The BI-Cycle system provides tools that are specialized for maintenance information and knowledge based analysis. The term OLAP was created as a slight modification of the traditional database term OLTP (Online Transaction Processing).
Databases configured for OLAP employ a multidimensional data model, allowing for complex analytical and ad-hoc queries with a rapid execution time. They borrow aspects of navigational databases and hierarchical databases that are speedier than their relational kin.
Nigel Pendse has suggested that an alternative and perhaps more descriptive term to describe the concept of OLAP is Fast Analysis of Shared Multidimensional Information (FASMI).
The output of an OLAP query is typically displayed in a matrix (or pivot) format. The dimensions form the row and column of the matrix; the measures, the values.
On-condition maintenance:
The detection of a potential failure. Also known as condition based maintenance (CBM) and predictive maintenance (PdM).
An adjective describing a process, configuration, behavior, or methodology that will "best" achieve a desired objective or state. We would say that an optimal maintenance plan achieves lowest cost, highest availability, or some desired reliability or performance index, or a best compromise among several objectives. The expressions "optimum cost", "optimum reliability" are incorrect uses in the maintenance context. Rather say, for example, "...an optimum decision model that leads to lowest overall unit cost". See the
EXAKT Cost Model for more discussion on optimization.
PHM (proportional hazard model)
A method for relating an item's survival probability to its working age and to additional information in the form of
significant variables. (Cox and Oaks, 1984). These time dependent variables may be extracted from
condition data. The model may be represented as h(t)=h
o(t)exp(
X(t)γ),where h(t) is the hazard rate function, h
o(t) is a baseline hazard function (e.g.
Weibull),
X(t)γ is a vector of significant variables (called covariates) and their respective parameters γ
i.
(Not to be confused with PHM
prognostic health management.)
PM:
Preventive Maintenance. Scheduled tasks that include: failure finding , on-condition (aka CBM, predictive maintenance), rework, and discard tasks.
Potential failure:
See
Failure.
Prime Mover:
Australian for
tractor as in "tractor-trailor"
Prognostics
A maintenance action designed to detect a potential failure in order to mitigate the consequences of an eventual functional failure. See also
Diagnostics.
Random event
Also called "random failure". The event is not influenced by time. External stresses having no time related pattern cause random events. A randomly failing device will fail with a 63% probability prior to reaching its mean time to failure (MTTF).
Random variable
A random variable, for example, the height of a Canadian, has no fixed value. The average height of a Canadian, on the other hand, would be a deterministic variable. Random variables are completely described by their probability distributions. Partial information about a random variable may be imparted by one or more statistical descriptors such as the mean of the probability distribution and its standard deviation.
RCMREF
An index number primary key to the
reliability knowledge base. By linking a work order or standard operating procedure to a RCMREF, one ensures that the latest knowledge regarding asset failure behavior is referenced. This causes each work order to become an instance of a knowledge record, thereby enabling
reliability analysis. Furthermore by revisiting the knowedge base in everyday work order practice, we conveniently grow and refine it.
Regular maintenance interval
This is an option in EXAKT that is used when setting the decision parameters. This optional parameter of the CBM Model will, if applicable, improve the calculation of the optimal policy. The Regular Maintenance Interval refers to non-rejuvenating events performed regularly in time and those actions are known to impact the covariate values. Such events may include minor adjustments, calibrations or oil changes carried out at some interval of the working age. For example, oil changes performed every 600 hours.
When this option is set, EXAKT will automatically initialize one or more covariate values to the fixed values specified by the user. As an example, at regular oil change intervals, the numbers of metal particles for wear metals are set back to (say approximately) zero.
Reliability:
Usually defined as an item’s MTTF. Alternative definition: The
survival probability of the item for a given mission duration. We (RCM and living RCM practitioners) often use the term "reliability" in its broadest sense, "the attainment of a desired availability, quality, and production rate at lowest cost, safely and without exceeding regulatory environmental emission limits".
Reliability analysis:
Synonym for “Age Exploration”: Any analysis procedure that examines historical data in order to improve the maintenance plan by increasing an item’s reliability, availability, maintainability, productivity, or by reducing cost, safely. The analysis, facilitated by an
Events table, will often make use of software algorithms and techniques to determine an item's failure behavior as it relates to its working age and other
internal or external variables. See
age exploration
Reliability analysis, in its simplest yet
effective form is the visual examination and counting of the instances (work orders) of a given RCM record. The two synchronized tables (
knowledge base and related work orders) are shown in the figure on the right. When a maintenance department makes this type of tool available, universally, to technicians, planners, engineers, and operators, all are motivated to increment and reference the very knowledge base that assists their decision process. Such collaboration is doubly encouraged because individual donors to this vital intellectual asset are recognized directly within the knowledge records to which they have contributed.
A reliability block diagram is a graphical representation of a system’s components arranged in a manner to convey the ways in which system-level failures occur. Included within this representation are the logical relationships between the components and the system itself. Reliability block diagrams are denoted as “RBDs” and usually contain components arranged in a combination of series and parallel sub-structures.

The reliability block diagram is used in a Monte Carlo simulation software application. The user inputs information about each block, principally, the age reliability distribution type (exponential, lognormal Weibull, etc) to be used and the parameters for that relationship and the distribution for the time to return to service. The simulation calculates numerous features including system reliability and availability.
Reliability-centered:
Adjective indicating the aim of sustaining and improving
OEE and
reliability, safely and without discharging materials into the environment in excess of current regulatory limits.
Reliability-centered maintenance (RCM):
A 7-question
process used to determine what must be done in order to preserve the functions of an asset in its operating context at a level exceeding that required by its users. The seven questions are:
- What is the item supposed to do?
- In what precise way can it fail to perform a required function?
- What is the event that causes that failure?
- What happens surrounding the failure?
- Why does it matter (i.e. the consequences)?
- What regular maintenance task will mitigate the consequences to an acceptable degree?
- What needs to be done if no form of routine maintenance can be found that is adequate?
The tangible output of RCM is a
reliability knowledge base.
Living RCM refers to a process whereby the knowledge base is accessed and updated within the normal work order process. Implementation of the RCM analysis requires that the decided tasks, procedures, material specs, frequencies, and executor designations be transferred to the CMMS. Some RCM software (e.g. BI-Cycle decision tool) is capable of automating this step.
Safety instrumented system (SIS)
A SIS performs specified functions (SIFs) to achieve or maintain a safe state of the process when unacceptable process conditions are detected. SISs are composed of sensors, logic solvers, final elements and support systems. Safety instrumented systems are separate and independent from control systems.
Safety instrumented function (SIF)
A safety function allocated to the safety instrumented system with a safety integrity level (SIL) necessary to achieve the desired risk reduction for an identified hazardous event.
Safety integrity level (SIL)
IEC 61508 specifies 4 levels of safety performance for a safety function, called "safety integrity levels". Safety integrity level 1 (SIL1) is the lowest level of safety integrity and safety integrity level 4 (SIL4) is the highest level. The standard details the requirements necessary to achieve each safety integrity level. Example of a safety function:
When the hinged cover is lifted by 5 mm or more, the motor shall be de-energised and the brake activated so that the blade is stopped within 1 second. The safety integrity level of this safety function shall be SIL2.
Sample:
Observations of an item’s (or group of similar items’) installations, failures, preventive renewals, significant events, and condition data over a period of time. Each "point" in the sample is a life-cycle (or history). A sample, physically, is a set of
data.
SCADA
Supervisory, Control & Data Acquisition
Signal Processing
Signal processing in CBM is also known as
feature extraction. We apply a mathematical algorithm to the raw
condition data in order to extract features that track the condition of targeted failure modes. For example, when the sidebands around the gear mesh frequency increase then suddenly decrease and widen this may indicate excessive gear tooth wear. Or, when when the
fault growth parameter increases, this may indicate the failure mode "tooth root crack". There are as many signal processing techniques and algorithms as there are differing physical situations. See Dr. Daming Lin's extensive
Survey of signal processing and decision technologies for CBM.
Significance level
This is a statistical concept from
hypothesis testing. It is defined as the probability that a given decision process will reject the
null hypothesis when the null hypothesis is actually true (also known as "Type I error"). The decision is often made using the p-value: if the p-value is less than the significance level, then the null hypothesis is rejected. The smaller the p-value, the more significant the result is said to be.
Such results are informally referred to as "statistically significant". Popular levels of significance are 5%, 1% and 0.1%. For example, if someone argues that "there's only one chance in a thousand this could have happened by coincidence", a 0.1% level of statistical significance is being implied. The lower the significance level, the stronger the evidence.
A small significance level has both advantages and disadvantages. A smaller significance level gives greater confidence in the determination of significance, but runs a greater risk of failing to reject a false null hypothesis (also known as "Type II error"), and so has less statistical power. The selection of a significance level inevitably involves a compromise between significance and power, and consequently between the Type I error and the Type II error.
When EXAKT indicates that a proportional hazard model is "Not rejected at the 5% significance level" it is saying that we would be wrong 5% of the time in rejecting the model that really does represent the data.
Significant:
OMDEC uses the word "significant" differently in several contexts:
1. with respect to variables (also called "covariates"):
A condition monitoring or process variable is significant if it has been found in the EXAKT proportional hazard modeling (
PHM) analysis to correlate with the incidences of failure of an item. Covariate values Z
i multiplied by their respective parameters γ
i are exponents in the PHM equation (model). Among the statistics used to determine significance is the "p-value".
For every covariate parameter γ
i included in the model, the hypothesis that γ
i = 0 is tested, i.e. that this covariate is not significant in the model. If the p-value is small (< 5%-10%), the hypothesis that γ
i = 0 can be rejected, i.e. we can assume that this covariate is significant and should be included in the model. If the p-value is > 5%, but not too large (say 10%-15%), different models with or without this covariate can be examined.
Significant:
2. with respect to functions, failures, failure modes, and their associated work orders:
An item, function, functional failure or potential failure, its underlying cause (failure mode) and its associated work order is said to be
significant if the consequences of the functional failure matter. That is they are either:
- hidden (i.e. undiscovered until a protected function fails. Applies to safety devices and redundant systems), or
- impacting safety, health, or the environment, or
- operational (impacting production quality, customer service, delivery), or
- non-operational but economically important
Significant event:
Operational or maintenance events that impact an item’s failure resistance or its condition data (
non-rejuvinating events).
Simple Weibull Model
EXAKT uses the term "
Simple Weibull Model" to designate the special case where no variables (other than working age) are known to influence failure probability.

The parameters η (scale parameter) and β (shape parameter) are age-related parameters.
β indicates the influence of an item's working age on the hazard function. 0 < β <1 means improvement with age, β > 1 means deterioration with age, and β = 1 means that failures are completely random (accidental) and independent of age. In that latter case the hazard h(t) = 1 / η, i.e. hazard is constant.
η indicates the scale magnitude on the time axis, and it is also related to the average time to failure (MTTF). Theoretically, if β = 1 then MTTF = η, and if β > 1 then MTTF is between 89% and 100% of η. If β < 1 then MTTF > η.
Simulation
Refers to
Monte Carlo simulation. It is a method in reliability engineering to predict what would happen in the future if a maintenance policy, external condition, failure behavior, or production variable were altered. The capacity to perform “what if” analysis on the future impact of such changes assists the physical asset, financial, and operational managers. They may ask questions of the type, “What will the downtime/availability/reliability/cost be of my system if I double/triple/halve the overhaul frequency?” They can test decision scenarios such as these by building and running a calculation engine on a model known as a
reliability block diagram. Reliability Block Diagrams (RBDs) are ideal for modeling complex, large scale systems. Diagram types include simple series, parallel operating, standby redundant, bridge networks, and any type of random network configuration of the above. They model redundant configurations and other real-world scenarios to provide a wide range of calculations: availability, reliability, unavailability, MTBF, failure rate, expected number of failures, mean unavailability, total downtime, failure frequency, and hazard rate.
The computational intensive technique involves repeated trials (hundreds, thousands or hundreds of thousands). Each trial is randomly generated from the assumed failure probability distribution of all components. The results of each trial are tabulated, and the performance (cost, reliability, availability, etc) is projected over a future time horizon of, say 6 months or a year. Then certain operating parameters can be changed and the simulation run again, in order to eventually arrive at a selected operational and maintenance strategy.
Suspension:
Refers to replacement (discard) or rework of an item for any reason other than its failure (potential or functional). A life cycle that has been
suspended becomes a "censored" (right-censored) data point in the
sample. In EXAKT the symbols "EF" and "ES" are used to indicate "ending by failure" and "ending by suspension" respectively.
Useful Life:
The age at which the conditional probability of failure begins to increase and to which most items of the same kind survive. The conditional probability of failure (or hazard) curve resembles
failure pattern B in Nowlan and Heap's "Reliability-centered Maintenance" report of Dec. 31, 1978 to the U.S. Department of Commerce.
Voting system
Expressed as MooN (M out of N) architecture. “N” designates the total number of devices that are implemented in parallel; “M” designates the minimum number of devices out of N that are required to initiate shutdown conditions or to achieve a defined output action.
Working age
A measurement in an engineering unit whose value is proportional to the accumulated stress on an asset or work performed by the asset. Calendar time (or, for example, hours on an hour meter) could be used if the asset were operated steadily (a rare situation). Fuel consumed, tons of ore crushed, widgets produced, or rounds fired would often be a more appropriate measure for working age.
In CBM, we may consider the working age as a variable that encompasses the totality of undefined variables that have a significant influence on failure probability. If EXAKT determines that working age bears no influence on failure, it means that the state of the condition monitoring variables, found by EXAKT to be significant, adequately reflect the true health state of the modeled equipment, component, or failure mode. On the other hand, if working age is determined to be a significant risk variable, it means that the available condition data alone cannot fully represent the item's state. In this case failure is said to be dependent on both the working age
and the significant condition indicators. This is the general case. For more information see
The elusive PF Interval.