Assessing the Cost of Unreliability in a Gas Plant to Have a Sustainable Operation – Part 2/2
With ever-increasing worldwide pressure, it is essential that gas and petrochemical plants operate with high reliability and safety to maximize the return obtained on capital investment. For business, the financial issue of reliability is controlling the cost of unreliability (COUR) from equipment and process failures that waste money. This is the second part of an article, the first part of which was published in MaintWorld issue 4/2013.
THE FIRST part of the article introduced Cost of Unreliability (COUR) methodology for assessing COUR, and used an NLG gas plant as an example.4) Equipment Reliability Analysis Once a big-picture view of a gas plant has been acquired, a more detailed analysis can be performed to optimize reliability- engineering resources. Critical equipment with high maintenance costs and a high failure frequency is selected (from Pareto chart) to investigate what is going wrong. For these important pieces of equipment in each process area (in terms of money and failures) a reliability engineering analysis should be developed. One of the most important statistical tools for this application is the Weibull analysis. The Weibull life distribution was developed by Wallodi Weibull of Sweden to investigate metal fatigue failures. The Weibull distribution can take on many shapes, depending on the value of the shape parameter beta. In fact, by varying the value of beta, all phases of a bathtub curve can be modelled using the Weibull distribution. The Weibull analysis is a powerful mathematical tool for life data analysis of all pieces of process equipment. The practitioner attempts to make prediction about the life of items based on failure data. The Weibull distribution is one of the most widely used lifetime distribution in reliability engineering. It is a versatile distribution that can take on the characteristics of other types of distributions, based on the shape parameter and beta (β). The Weibull reliability function is given by the equation (3):
ezembed
(3) whereRt Reliability at mission time tt Mission time (hours, months, years etc)η Characteristic life (hours, months, years etc)β Shape parametera) Advantages of Weibull Analysis: Provides a reasonably accurate failure analysis and failure forecasts with extremely small samples (solutions are possible at the earliest indication of a problem). The Weibull data plot is particularly informative, see figure 5. The horizontal scale is a measure of life or aging (start/stop, cycles, hours, miles, landings etc). The vertical scale is the cumulative percentage failed. The two defining parameters of the Weibull line are the slope, beta and the characteristic life, eta. The slope of the life, beta (β), is particularly significant and may provide a clue as to the physics of the failure. The characteristic life, eta (η), is the typical time to failure in Weibull Analysis. It is related to the Mean Time to Failure (MTTF).b) Failure distribution: The slope beta indicates which class of failure is present.β < 1 Indicates infant mortalityβ = 1 means random failures (independent of age)β > 1 Indicates wear-out failuresThe Weibull plot shows the onset of the failure. For example, it may be of interest to determine the time at which 1% of the population will have failed. Weibull calls this the “B1” life. From the example curve B1=150hrs.The horizontal scale shows the age to failure and the vertical scale shows the Cumulative Distribution Function (CDF), describing the percentage that will fail at any age. The complement of the CDF scale (100-CDF) is the reliability.The characteristic life (η) is defined as the age at which 63.2% of units will have failed, called “B63,2 life”. The special case where β=1 (Weibull and exponential distribution are equal), the Mean Time To Failure (MTTF or MTBF) and η are equal.
ezembed
Fig 6. Failure of electrical fuse.
ezembed
Fig 11. Electric motor’s failed parts: stator, rotor, bearing race, bearing cage, ball bearing.c)Maintenance decision: The Weibull plot is extremely useful for maintenance planning, particularly RCM (Reliability Centred Maintenance). The Beta (β) value tells the analyst whether or not scheduled inspections, or overhauls are needed. If β ≤ 1, then overhauls are not cost effective. With β > 1, the overhaul period or scheduled inspection interval is read directly from the plot at an acceptable probability of failure. For the wear-out failure modes, if the cost of unplanned failure is greater than the cost of planned replacement, there is an optimum replacement interval at minimum cost.
ezembed
Fig 5. Weibull Probability Plot.d) Real Case-Problem description: In a large oilfield in South America, numerous recurrence failures have been detected during infrared thermography inspection on fuses (responsible for supply energy to oil well pumps) of main isolator switches. A reliability study was required to determine the best maintenance inspection frequency; running out of energy would have a big impact on an oilfield’s business.One failure mode and mechanism has been identified through the Failure Analysis process. The physical root identified as the main fuses’ failure was environmental corrosion, caused by the H2S in the atmosphere. The Weibull analysis was selected as the best tool to determine the optimum inspection frequency and would corroborate the failure mode investigated on field, based on the beta value result. Figure 6 shows thermography inspection and the physical failure mode of an electrical fuse respectively.Life data analysis was performed applying the Weibull distribution. In Table I the electrical fuses failure data is shown.
ezembed
In this study a typical commercial software package was used for data analysis shown in Table I. The best-fit distribution was the Weibull two parameters. In Table II the Weibull parameter results are shown.
ezembed
The value of beta 2.6 indicates that the main failure mode is wear-out and corrosion is the main damage mechanism that was corroborated on field (see Fig.6)Since the beta value (larger than one) indicates that the failure mode is wear out, a replacement policy should be analyzed. Figure 7 shows the reliability curve for the electrical fuses problem studied.
ezembed
Fig 7. Electrical fuses reliability curve.Component replacement policy option: A common problem for maintenance managers is to determine a policy to adopt in regard to the replacement of components, which do, or may, fail. The current case of the electrical fuses is typical of one of the most common forms of problem encountered in practice. The appropriate policy will depend on such factors as:
- The reliability of the component as a function of operating life and in particular, whether wear out occurs.
- The costs arising if we need to replace a component at an inconvenient time as the result of actual failure, or of the detection of an imminent failure condition.
- The costs associated with the replacement of the component before failure, at a convenient time, for example at a routine maintenance time.
For this particular analysis of the electrical fuses problem two types of situations in which component replacements occur have been considered.(a) Failure Replacement: A failure replacement is a replacement that occurs following the failure of a component in service, or following the identification of an unfavourable condition that leads us to promptly replace the component, within a short time of the condition being detected.(b) Preventative Replacement: A preventative replacement is the replacement of a component that has not failed.Preventative replacement can only be worthwhile if two conditions hold, one is when the failure rate of the components is increasing, or will increase before another preventative replacement opportunity occurs. The second is when the cost of failure replacement is greater than the cost of preventative replacement. Preventative replacement is not appropriate if the failure rate (hazard function) is decreasing or constant (beta < 1 or > 1). For this reason a life distribution analysis of the electrical fuses was initially carried out to determine that wear out was occurring. Even if wear out occurs, the choice of policy will also depend on the cost of preventative replacement being less than the cost of failure replacement. Preventative replacement policies result in loss of useful life of the components that are removed before failure. For preventative replacement to be worthwhile, cost savings resulting from fewer failure replacements must more to compensate this loss. This can only occur if failure replacements are expensive when compared to preventative replacements. The cost of making a preventative replacement is usually less than the cost of failure replacement because we can arrange for preventative replacements to be made so as to avoid loss of production. Also, if preventative replacement is carried out as part of a routine service or overhaul, the repair cost tends to be reduced because the replacement can be done as part of the other work.
ezembed
Fig 12. Bearing cage fractography analysis.
ezembedezembed
Fig 9. Risk versus time.In this case study, important factors were (a) the cost of lost production that resulted from a catastrophic electrical fuses failure and (b) the preventative replacement cost. The first cost is called “unplanned replacement cost” (Cf ) which includes the cost of the replacement component, the cost of labour and related overheads, the inclusion of allowance for a percentage overtime working and the cost of lost production in an average case. For this case study Cf is equal to $100,000. The second is called “preventative replacement cost” (Cp). Thus the cost consists of, the cost of the replacement component and the cost of labour and related overheads. In the present case, this was estimated at $5,000.The cheapest age-based preventative replacement policy is the one that has the lowest long-run cost per unit time. This cost is derived as the ratio of the average replacement cost per component and the average life per componentThe best optimum replacement frequency is calculated by equation 4. That is the frequency that makes minimum the value of C(t).
ezembed
(4) whereC(t) Cost per unit time ($/month)R(t) Reliability value of component at time tF(t) Unreliability value of component at time tCp Cost of preventative replacement ($)Cf Cost of unplanned replacement ($)Figure 8 shows the optimum replacement curve for this case study. The minimum replacement cost occurs at two months with a cost per unit time of $4,127, but a fuse replacement frequency of 4 months was selected with a cost per unit time of $6,000. The cost per unit time of replacement at failure is $16,000 per month.
ezembed
Fig 8. Cost per unit time versus replacement time curve.The optimum saving is $11,873 per month ($16,000 per month - $4,127 per month). The fuse replacement policy selected was at 4 months. The real money saved was $10,000 per month ($16,000 per month - $6,000 per month).The business risk of making a preventative replacement at four months is $19,000. A tolerable risk has been set out at $20,000 per month, meaning that preventative replacement at four months is acceptable. Figure 9 shows the business risk through time. This kind of tool helps managers make the right decision based on risk.5) Root Cause Failure Analysis: The failure analysis process is a key important area for any organization during the assessing cost of unreliability. The purpose of this step is to identify the main cause of failure in order to avoid the recurrence of the problem. If the main causes of failures are not addressed, the annual cost of unreliability will keep growing year by year.Failure analysis is a multidisciplinary (electrical, mechanical, electronic, hydraulic) activity. Because a single analyst may not be equipped with the knowledge in various disciplines such as metallurgy, materials science, corrosion, structural mechanics, aerodynamics and so on, which may be necessary for analyzing a particular failure or accident, a team effort is always desirable to arrive at a correct solution to the problem. The RCFA process requires the right information to make a good engineering analysis; those levels of information are:a. Background information: Whenever a failure analysis is to be carried out, it is essential at the outset to collect the relevant background information. This facilitates the development of a complete case history on the failure. The information to be collected falls into two groups: Information about the failed component and information about the failure itself.b. Information about the failed component includes: Name of the component, identification number, manufacturer and user, location, intended function, new service life since last overhaul, design loads, actual service loads and load orientation, frequency of loading service parameters such as temperature, pressure, current, tension, environment, materials of construction, specification and codes, fabrication process, thermo mechanical treatments, inspection techniques and records, maintenance records.c. Information about the failure itself includes: Date and time of failure, extent of the damage, operation conditions immediately prior to the failure, service abnormalities. The most important task in any failure analysis is the consolidation and systematic connection of all data obtained during the analysis. For effective communication of the results of failure analysis, documentation is extremely important. The report should be clear and contain the logic behind the conclusions. It is extremely important for the investigation team to keep track of the follow-up action based on his or her recommendations. Continued interaction with the end-Users, manufacturer, and operator is highly desirable. Figure 10 shows the entire process to perform a good Root Cause Failure Analysis (RCFA).
ezembed
Fig 10. RCFA Process.One of the most important and quite common issues in Oil & Gas companies is that failure data is not analyzed in a scientific manner. Rarely, End-users or OEM acquiring the data know that this data can be useful to solve their problems. Most chemical, gas and petrochemical plants are full of data in the CMMS system or database. For this reason organizations should start to review their own data and turn it into valuable information in order to make the right decisions.Real Case-Problem description: An electric motor (300KW) from a screw compressor that provides dry air at 7 bar to the plant’s instruments has been taken out of service so as to perform a major maintenance at 40,000 hours of operation (to perform winding inspection and other electrical testing). To make use of this opportunity the roller bearing has been replaced. At almost 5000 operating hours since last inspection, the motor has suffered a catastrophic failure. Due to the high consequence of failure and equipment criticality, a Failure Analysis was required to discover what happened and to avoid recurrence of problem. Figure 11 shows the electric motor’s failed parts.From figure 11 it can be seen that there is internal wearing damage to the stator, and that the rotor is out of the motor centre-line alignment during the failure process. It is a clear clue that the stator and rotor have been in contact during the failure process. The inner rolling bearing race image shows a considerable plastic deformation; the bearing race was found welded to the shaft due to the high temperature developed during the failure process. The Failure Analysis has been performed studying the seven causes: fault design, material defect, process or manufacturing defect, maintenance or assembling fault, out of specification for service condition, maintenance defects (ignoring procedures), operation fault.After a deeper failure investigation that included metallographic and fractographic analysis of failed electric motor parts (bearing race, ball bearing and bearing cage), the failure analysis concluded that:
- The electric motor has failed due to a bearing cage failure. The bearing cage failed because of a fatigue process. Figure 20 shows in boxes a, b, c and d a clear indication of beach marks, indicating that a fatigue process has occurred in the bearing cage near the rivet hole.
- The ball bearing cage failed due to a lack of, or the wrong handling and mounting bearing good practices.
- The cage failure is an unpredictable and undetectable failure mode of bearing. This failure mode cannot be detected with vibration analysis. Recommendations were:
- Best practices of handling and mounting bearing should be supplied for subcontractors.
- A temperature sensor and vibration sensor should be located on bearing housing to avoid catastrophic failure
- Lessons should be learned and communicated to the industry
6) Improvement Actions: In this phase, actions for plant performance improvements shall be selected and implemented for those areas of the process plant that are responsible for diminishing the business gross margin. These actions should be selected based on the information collected during the RBD, Weibull and RCFA analysis, and at the same time shall be managed by the risk using the business risk matrix shown in figure 3. For this reason during this critical section of analysis the top management should be engaged in making the right decisions and setting investment and improvement priorities over the equipment that generates the largest risks to the business. Figure 13 shows the improvement priorities based on the business risk, the equipment inside the red boxes shall be attended to first, then the equipment inside the orange boxes, then the equipment within the yellow boxes and finally the equipment within the green boxes. The top management needs to understand that a business risk = “0”, none exist. There is the possiblility of suffering a failure, but focusing the maintenance and reliability strategy on risk management will have a very positive impact on the company revenues. Less failure environment means a positive environment to work in without pressure, less probability to suffer an accident, more time to think in a proactive manner instead of thinking how to repair and more time make more the process plant profitable.
ezembed
Fig 13. Business Risk Matrix for management decision.
Conclusions
This article has described a detailed procedure to review the Cost of Unreliability value for a typical gas plant. The gas plant data has been explored using new and different tools to solve reliability problems and help managers make the right decisions from the business point of view and applying the risk management supporting decision. Real cases were shown to see how to this methodology could be applied with effective results.In the Oil & Gas industry it is quite common to find high availability numbers but low reliability figures. For this reason most companies could not advertise the money lost due to unreliability (high frequency failures). The cost of unreliability index is a simple and practical reliability tool for converting failure data into cost, helping managers and the entire organization to understand the problem by putting it in writing and on paper. Unreliability cost will increase year by year unless a reliability programme is put in place with top management commitment.This paper shows the importance of applying Reliability Engineering to the gas process plant. End-Users and OEM that start to implement this kind of tool and methodologies in their sites will take several advantages against their competitors and will have a sustainable operation.