Optimizing a Mine Haul Truck Wheel Motors’ Condition Monitoring Program: Use of ProportionalHazards Modeling

A.K.S. Jardine*, D. Banjevic*, M. Wiseman § , S. Buck^†, T. Joseph*,

*Condition-Based Maintenance Laboratory, Department of Mechanical and Industrial Engineering, University of Toronto

§PricewaterhouseCoopers, Toronto

†Cardinal River Coals Ltd., Alberta

Abstract:

The paper discusses work completed at Cardinal River Coals in Canada to improve the existing oil analysis condition-monitoring program being undertaken for wheel motors.

Oil analysis results from a fleet of 55 haul truck wheel motors were analyzed along with their respective failures and repairs over a nine-year period. Detailed data cleaning procedures were applied to prepare data for modeling. In addition, definitions of failure and suspension were clarified depending on equipment condition at replacement. Using the proportional-hazards model (PHM) approach, the key condition variables relating to failures were found from among the 19 elements monitored, plus sediment and viscosity. Those key variables were then incorporated into a decision model that provided an unambiguous and optimal recommendation on whether to continue operating a wheel motor or to remove it for overhaul on the basis of data obtained from an oil sample.

Wheel motor failure implied extensive planetary gear or sun gear damage necessitating the replacement of one or more major internal components in a general overhaul. The decision model, when triggered by incoming data, provided both a recommendation based on an optimal decision policy as well as an estimate of the unit's remaining useful life (RUL). By optimizing the times of repair as a function both of age and condition data a 20-30% potential savings in overhaul costs over existing practice was identified.

Keywords: Wheel motors, Condition monitoring, Oil analysis, Proportional-hazards modeling, Optimizing condition-based maintenance decisions, EXAKT software

Practical Implications:

Current practice for monitoring the health of items is through examining trends in readings obtained from various forms of condition monitoring. Interpretation of these readings is undertaken by an inspector reviewing current and past readings, or through using commercially available trending software. Such an approach does not guarantee that the full information-value contained in the readings is captured. The paper uses a statistical procedure called proportional-hazards modeling to identify the key measurements that should be used to assess the true state of health of the equipment. Economic decision rules are then established. The procedure is described through a case study that reports on the optimization of condition-based maintenance decisions for haul truck wheel motors that are monitored through oil analysis. Application of the procedure demonstrated a 20-30% potential savings in overhaul costs compared to current practice.

This paper underlines the importance of data cleaning and applying a consistent definition of failure based on both the observed equipment condition at repair time and the inability of the equipment to perform its functions. (for additional discussion see Campbell and Jardine 2001).

Introduction

Cardinal River Coals Ltd. is a 50/50 joint venture between Luscar Ltd. and Consol of Canada, Inc. The mine is located approximately 50 km south of Hinton, Alberta on the eastern slopes of the Rocky Mountains. The coal produced from the mine is low sulfur, high quality coking coal used for steel making. Cardinal River Coals Ltd. opened in 1970 as a multiple open pit mine using the truck and shovel mining method. Current annual production at the mine calls for the removal of 21 million cubic metres of rock and 2.8 million tonnes of coal. The mine has won multiple awards for the land reclamation and creating wildlife habitat.

There are 26 haul trucks at the mine site, each having two wheel motors. With 3 spare wheel motors the fleet numbers 55.The existing policy, based on experience, is to rebuild the units after about 20,000 hours of operation. Oil analysis is carried out monthly whereby the amount of sediment (weight of filter patch filtrate) and parts per million (ppm) of five out of the nineteen elements are noted: iron, silicon, chrome, nickel, and titanium. The decision to remove the unit for rebuild is based on manual perusal of the values of these elements in combination with the unit's age.

Wheel motor failures relating to the electrical drive elements and braking system were not included in this study since their condition is not reflected by oil analysis data. Seal replacements were carried out frequently as a result of high contamination and coincided with oil change outs. The oil change out event (OC) is considered as a “minor” repair. The analysis shows that a high amount of sediment persisting in spite of these corrective measures, is associated with a high risk of failure.

Statistical analysis of the CRC wheel-motor data showed a high correlation between iron and silicon. That fact would support the view that there are a high number of failures which are contaminant induced. Hence one may conclude that there is an event or set of conditions that initiate a process of deterioration in the wheel motor. It is assumed that by overhauling the unit before the damage becomes more extensive one would benefit from savings through failure avoidance.

Data Availability

Within the mine’s computer maintenance management system (CMMS) there were histories of wheel motor lifetimes, including details of removals due to failure or preventive maintenance as a result of interpretation of the signals obtained from oil analysis. Costs associated with the failure and preventive removals were also available.

Additionally, there was a database containing a vast history of condition monitoring test results – some 50,000 records.

It may seem that it would be an easy matter to peruse and study these two data sources and learn which patterns of data have been associated with past failure, thus identifying the data combinations that might be employed as condition indicators of future failures. Unfortunately identification of the key condition indicators from amongst all the data collected is seldom obvious to the analyst. The complexity, volume, and time lags within the data render them elusive if not impossible to discern without the proper tools.

In this paper we show a tool that uses a statistical modeling technique known as proportional-hazards modeling to bridge these two invaluable data sources. It is the central function in a program called EXAKT developed precisely for this purpose by the condition based maintenance (CBM) laboratory at the University of Toronto (see Jardine et al, 1997).

Model Building: The Proportional-hazards Model (PHM)

The Weibull-PHM (Equation 1) relates an item’s reliability to its working age and to significant measured variables known as condition monitoring (CM) data.

Equation 1 The proportional hazard model (PHM) with Weibull baseline

In Equation 1 h(t) is the (instantaneous) conditional probability of failure at time t, also known as the hazard function, given the values of Z₁(t), Z₂(t), … Z_m(t). Each Z_i(t) in represents a monitored condition data item at the time of inspection. These measurements may include the parts per million of iron or the vibration amplitude at the second harmonic of shaft rotation, or any variable that reflects the accumulated stress or damage with respect to some mode of failure. These condition data are sometimes called covariates. The g’s are the covariate parameters weighting the degree of influence each covariate has on the hazard function. The model consists of two parts, the first part is a baseline hazard function (Weibull) that takes into account the age of the equipment at time of inspection, . The second part, , takes into account the key variables and their associated weights.

Data Cleaning Related to Wheel Motor Event Records

The first step in every proportional hazard model (PHM) building exercise is a thorough examination of the data. The data-cleaning phase of PHM is considered the most important one of the entire modeling process. If we are to accomplish our objective of accurate and automated CBM data interpretation, the data upon which we intend to build the model must be as free of error as is possible. Much of this paper, therefore, focuses on the data investigative or cleansing phase of model building.

Fortunately software provides us with ample tools with which to “cleanse” the data. In Figure 1 we show a feature of the EXAKT software (2) that discovers many logical data inconsistencies emanating from the CMMS, thereby helping the analyst to make the corrections that will improve the ultimate model’s precision.

Figure 1 Data checking tool

Data required for PHM analysis consists of "histories" or life-cycles. Each valid history for a wheel motor must have a Beginning event (B), an Ending event (EF for failure, or ES for suspension (such as a preventive removal)) and Inspection events. A discussion of how suspensions and failures were determined is given later in this paper. A history could also have events that are known to affect covariates, such as oil change (OC) events.

The output of the data - checking tool (illustrated in Figure 1) points out probable errors based on a systematic evaluation of working ages and corresponding calendar dates as reported in the CMMS. The software deduces, from the dates and working ages, which inspection and event sets of data comprise individual histories. When it finds a history without an ending reported, it asks whether the ending event should be designated as a suspension (ES), a failure (EF), or whether the the life-cycle has not yet ended. In this latter case, the software assigns it temporary suspension event (denoted by *ES in the software) Temporary suspension means that the item is still operating at the time of the data analysis. Eventually that item, in the future, will end its life either by an ES or EF event.

The software also calls the analyst’s attention toanomalies that may indicate data problems or errors. These will include two inspections on the same date (date and time) or working ages and calendar dates that are out of synchronization. For example an inspection at a later date may have a lower working age than an earlier inspection. This is obviously an error in data transcription, which, if allowed to persist in the data would compromise the model’s accuracy.

Most errors can be easily be corrected, usually simply by inserting missing Beginning and Ending events for each history.

Data Cleaning Related to Wheel Motor Condition Monitoring Records

Oil analysis can be examined graphically using various combinations of covariates, dates, ranges, and scales. Figure 2 reveals suspiciously unusual values of silicon forming a horizontal line at exactly 900 parts per million (PPM). Condition monitoring data rarely behaves so precisely.

Figure 2 Graphical analysis of inspection records

Investigating with the commercial laboratory it was found that for a period of time the photo-multiplier tube on the spectrometer was saturating at exactly 900 PPM. In other words all values of silicon above 900 were truncated to 900 PPM. A similar situation occurred for iron above 2500 PPM. If not detected, this could play havoc with the PHM.

It was possible to compensate for the laboratory errors in preparing the data used to build the model. For example, the truncated values of ‘Si’ they were replaced with 1.2 x Fe. The factor of 1.2 was determined from the initial slope of the cross graph (a correlation graph) of Fe-Si and values obtained after the saturation defect was corrected. The truncated Fe values were not corrected since there were too few of them to influence the model.

The correction applied to the Si values is illustrated in Figure 3.

Figure 3 Corrected silicon readings

Cross graphs of pairs of covariates are invaluable in finding correlations that are of great help in developing and evaluating the eventual model.

Figure 4 shows the correlation between iron (Fe) and nickel (Ni). Correlations between other covariates were also tested. For example, Fe vs. Ti, Fe vs. Si and Ni vs. Ti graphs all exhibited similar correlation.

Determining correlation between covariates is useful both to provide insight into the data and to understand the models generated by the software. For example, if ‘Fe’ and “Ni” are highly correlated the modeling process could indicate that there is no point in including nickel in the model since it has been determined to provide no additional information regarding the probability of failure. These correlation could be the result of wear of a component whose metallic alloy contains both iron and nickel.

Figure 4 Correlating iron and nickel

Data Cleaning Related to Wheel Motor Oil Changes

When building the PHM it is necessary that account be taken of any minor maintenance work, such as changing the oil in the wheel motor. For example, Figure 5 illustrates that the actual transition path of oil measurements was from A to B to C to D. If we did not account for the oil change, then the modeling process would assume that the transition was A to B to D. This would be misleading and would tend to overestimate the risk of failure.

Figure 5 Correlating iron and nickel

In the EXAKT data preparation phase, the model is told what covariate values should be associated with those minor corrective events, such as an oil change (OC). Such events could include calibrations or cleaning that would tend to reset certain measurements but not materially zero-time the asset.

Figure 6 shows ‘missing’ or ‘irregular’ oil changes and obvious gaps. Calculations indicate oil ages of 7000-8000 hours. Such extended oil usage is unlikely given the use of mineral oils in this application. Obviously oil change events occurred but were unreported in the CMMS. The site did change to synthetics recently (1997) in order to eliminate the need for regular oil changes. However most of the data covers the period prior to 1997. It was thought that this information needed to be recovered from the commercial laboratory’s files. Unfortunately these files, too, were incomplete and inconsistent with the dates and working ages in the work order database.

Figure 6 Missing oil change events

Happily, it was determined that these 'missing' oil changes did not significantly affect the model since they were relatively few in number (with respect to all of the known oil changes). That is, there were a sufficient number of known oil changes in the database for the model to account for their effect on the measured data.

Building the Proportional Hazards Model (PHM)

After all the obvious data errors were eliminated or corrected, the proportional hazard model could be generated by the EXAKT software. As illustrated in Equation 1, the hazard function is the short term probability of failure of an unfailed unit at a given point in time. It is a function of both working age and the “significant” condition data. An iterative procedure in the EXAKT software removes the insignificant covariates and weights the significance of those remaining covariates found to influence the hazard. A number of candidate models are then tested to see how well they represent the actual data. One of the test methods used by the software is known as residual graphical analysis illustrated in Figure 1.

Figure 7 Residuals Graph

Each point on the residual graph of Figure 7 represents a history (the time from wheel motor installation to its removal for whatever reason). The sample used to build the model consists of many histories. The graph shows an unusual point that is well above the 95% upper limit. This leads one to investigate the underlying data corresponding to this residual. It was discovered that some ‘unusual’ data were included in that history which appear to violate the model we are attempting to build.

The unusual residual value was identified as corresponding to one particular history from wheel-motor 5509R, with beginning event at 48900 hrs and EF (ending with failure) event at 72005 hrs.

The ‘offending’ data is shown in Figure 8

Figure 8 Investigating the strangeness

The Fe values in the left-circled region of Figure 8 have an inexplicable pattern. Fe jumps to high values (but truncated at 2500 PPM due to instrument saturation) and remains in the same range for a few more inspections. Then, the readings fall back to low values. No failure or maintenance events were recorded to explain these sudden jumps.

Having no event data to support such high values of Fe and Si, the offending history was removed from the working data set, and, the model was regenerated. This time statistical goodness-of-fit testing procedures and graphical residuals analysis indicated immediate improvement of the model’s fit to the data. The model no longer had to accommodate obviously contradictory and misleading information.

The failure definition problem

The forgoing describes data problems which were encountered and which were relatively easy to correct using the statistical and graphical tools available in the software.

However a different (and more fundamental) problem occurred regarding the definition of wheel motor failure. These units seldom failed “functionally”. There were few “in-service” failures requiring that a haul truck be removed immediately from operation. Nevertheless a predictive model requires an objective determination that a unit had failed. It was, therefore, necessary to scrutinize the past work order records to distinquish failures from preventive removals. Initially, the tradesman remarks were used for this purpose, such as "High iron in oil sample and high hours, removed and replaced wheel motor." This event was then classified as a “failure”. However, on reviewing the re-builder's report attached to each invoice it became clear that some events initially classified as a failure should be treated as a suspension (a preventive repair). For example: If the gears had been replaced because they failed an ultrasonic test or they were obviously in a failed state then that life-cycle’s ending event should be classed as a failure. But if gears or other major components were replaced simply because it was expedient to do so, or if the unit was only generally rebuilt with no real internal damage (or major expense), then that history’s ending event should be considered a suspension.

With the definition of suspension and of failure thus clarified, a proportional-hazards model was re-built in the software and found to be a “good fit”.

The retained PHM

The model containing the covariates iron and sediments was found to be good, both by graphical residual analysis and by the Kolmogorov-Smirnov statistical test applied automatically by the software. The results of the analysis are displayed in Figure 9. Covariate significance is tested by the Wald statistic, the square of the standardized estimate of the parameter which follows a chi square distribution with 1 degree of freedom. (Note: A few missing sediment values had been replaced by the values from previous inspections prior to the analysis, hence the reason for using the notation CorrSed in Fig. 9). The PHM is thus:

Figure 9 The Proportional Hazards Model

Obtaining the Decision Model

After determining the PHM we next proceed to establish the optimal decision model (see Jardine et al 1997) that incorporates economic considerations along with the risk estimate obtained from the PHM. The decision policy was calculated with a cost ratio of 3:1 ($20K for preventive replacement cost, $60K for failure replacement cost, based on the invoices of past repairs by outside contractors See Figure 10). The cost ratio (between the cost of a failure and that of a preventive repair) will vary amongst applications and even in the same application at different times. Failure cost could include the costs associated with operational consequences depending on current production and coal market conditions.

Figure 10 Decision Model Graph

The model includes the effects of regular maintenance intervals (oil changes) at 500 hrs that occurred regularly during most of the histories prior to the changeover to synthetic oil. A model calculated without including the oil change intervals would tend to underestimate predicted failure times. Oil change events are accounted for by defining (in a database table called CovariatesOnEvent table) the values the covariates would attain were an inspection carried immediately following the (minor repair) event. The PHM parameters and transition probabilities are then estimated from this adjusted data. In subsequent calculations we have to take into account that covariates will regularly have their values reduced. Otherwise covariates will reach high values in the calculation process faster than in real data, and thus produce a higher estimated risk of failure than is really the case.

At present the model does not attempt to optimize inspection frequency (a future research feature). Nevertheless, by inspection of the current data on the decision model graph, one would likely choose to increase inspection frequency as the composite covariate (the weighted sum of those variables found to be significant risk factors) approached the boundary condition.

It is to be noted (in calculating the benefits of the optimizing model) that no operational savings were accounted for. This was due to the unfavorable coal market conditions at the time, causing the mine to operate below its capacity. As market conditions improve higher cost ratios will be used. Current strip ratios (total material removed versus sellable material) would also affect the savings associated with increased availability and reliability of each haul truck unit. Figure 11 demonstrates the sensitivity of the overall savings to changes or inaccuracies in the cost ratio.

Figure 11 Cost Sensitivity Testing

In real situations, the actual ratio of failure and preventive replacement costs may not be well known. Furthermore the dynamics of industry are such that costs can change with changing technology, production, and market conditions. Therefore one should consider how the true total cost (of failure and preventive maintenance when applying the optimal CBM interpretation policy of Figure 10) would change with changes in cost ratio. The software enables sensitivity analysis to be undertaken and generates a graph (Figure 11) and corresponding tabular data.

The curves on the graph are interpreted as follows.

Solid Line: If the actual cost ratio (CR) differs today from that specified when the model was built, that means that the current policy (as dictated by the Optimal Replacement Graph of Figure 10) may no longer be optimal. The line indicates the cost percentage increase that will be incurred (above the optimal cost/unit time for the actual cost ratio) by applying the “optimal” policy (which may no longer be optimal). For example, if the actual cost ratio is 5 and we are using a model which is based on CR=3, then the increase in the cost incurred by following that (now non-optimal) policy is around 6% (5.98). In other words the solid line represents the sensitivity of costs to changes in CR.

Dashed Line: Again, assume the actual cost ratio has strayed from what was used when the model was built. If the model is rebuilt using the new ratio the dashed line tells how much the new optimal cost would differ from that of the original model. (Note that the sensitivity graphs assume that only C_f (failure replacement cost) changes and C_r (preventive replacement cost) remains unchanged.) In other words the dashed line represents the sensitivity of the optimal policy to changes in CR. The graph indicates that moderate overestimation of the cost ratio does not significantly affect the average long run cost but provides a more conservative policy from the point of view of risk of failure.

Figure 12 Savings predicted for different economic conditions

The cost analysis summary shown on Figure 12 indicates a saving of 25%, when CR=3, over the “replace only on failure” policy. Those costs were found to be equivalent to the site’s actual past policy. Hence, no real cost savings were derived from the non-optimal oil analysis program that was in place.

Decision model results are also calculated for cost ratios of 5 and 6. As the cost ratio increases we can observe an increase in both the optimal policy cost as well as an increase in savings. The optimal decision models in these cases indicate that more frequent preventive replacements (from 74% to 91%) will result from applying the optimal decision policy. However those preventive actions will avoid a higher proportion of costs due to failure.

Applying the Decision Model

Once the decision model (of Figure 10) was built, previous data was examined in order to see what the decision model would have recommended prior to actual failure and preventive action. One illustration of such a history is shown in Figure 13. This graph provides a recommended decision based on inspection data (covariates and working age).

Figure 13 Retroactive application of the decision model

The decision ‘Replace immediately’ was suggested by the model first at working age = 11384 hrs, 286 hours (about two weeks) prior to failure (reported at 11660 hrs). The following inspection (at working age = 11653 hours) 7 hours prior to failure, also recommends the replacement of the wheel motor. The first warning may have been sufficient, given sample turnaround time of 48 hours, to prevent the consequences (high repair cost) of failure. Even prior to 11384 hours it can be seen from the decision graph that the results of the measurements indicate that a replacement recommendation was imminent. The zero points on the graph indicate default measurement values of zero (imputed by the software) immediately following oil changes.

An economic benefit associated with basing the maintenance policy on the Decision Policy Graph (of Figure 10). This investigation indicates a potential saving of between 20%-30% compared to the current practice.

Conclusion

The case study first demonstrates the value of using the technique of PHM to assist maintenance professionals to interpret condition data by identifying the key risk factors and their relative influence on the health of equipment in general, and wheel motors in particular. Economic considerations were then blended with the PHM risk model to identify the optimal decision chart. The study then indicates that the implementation of a condition-based maintenance strategy based on the decision chart would result in a cost reduction of 20%-30%.

References

Cox, D.R., (1972) “Regression models and life tables (with discussion)”, Jury. Stat. Soc. B, Vol. 34,pp. 187-220.

Jardine, A.K.S., Banjevic D. and Makis V, (1997) “Optimal replacement policy and the structure of software for condition-based maintenance”, Journal of Quality in Maintenance Engineering, Vol. 3, No.2, pp. 109-119.

Campbell, J.D. and Jardine A.K.S. (Editors), (2001) Maintenance Excellence: Optimizing Equipment Life-Cycle Decisions, Marcel Dekker, (Chapter 12: Optimizing Condition Based Maintenance, by M. Wiseman).