This week’s Pipeliners Podcast episode features Doug Rothenberg returning to the podcast for Part 2 of a discussion about alarm management and pipeline safety, this time focusing on alarm rationalization.
In this episode, you will learn about how to efficiently perform or support alarm rationalization to create value for your pipeline operation, the importance of gathering together controllers and supervisors to identify the best approach to rationalize the alarms, how automation can be used to optimize controller response to alarms, and more pipeline safety topics.
Alarm Rationalization: Show Notes, Links, and Insider Terms
- Doug Rothenberg is the President and Principal Consultant of D-RoTH, Inc., a technology consulting company providing innovative technology and services for industry. Doug’s specialty is control alarm management training and consulting for the industrial process industries. Find and connect with Doug on LinkedIn.
- Get Doug Rothenberg’s book, “Situation Management for Process Control,” as discussed in the podcast here.
- Alarm rationalization is a component of the Alarm Management process of analyzing configured alarms to determine causes and consequences so that alarm priorities can be determined to adhere to API 1167. Additionally, this information is documented and made available to the controller to improve responses to uncommon alarm conditions.
- An alarm management program is a method to manage the alarm rationalization process in a pipeline control room.
- SCADA (Supervisory Control and Data Acquisition) is a system of software and technology that allows pipeliners to control processes locally or at remote location. SCADA breaks down into two key functions: supervisory control and data acquisition. Included is managing the field, communication, and control room technology components that send and receive valuable data, allowing users to respond to the data.
- Distributed Control System (DCS) is an automated control system typically used in the processing industry that includes controllers spread out across a plant or control area. There is typically no central operator supervisory control.
- HSE refers to the focus on Health, Safety, and Environment issues in the area of pipeline safety and integrity management.
- AOCs (Abnormal Operating Conditions) are defined by PHMSA under 49 CFR Subpart 195.503 and 192.803 as a condition identified by a pipeline operator that may indicate a malfunction of a component or deviation from normal operations that may indicate a condition exceeding design limits or result in a hazard(s) to persons, property, or the environment.
- MOVs (Motor-Operated Valves) are used to completely open or completely shut down the flow of product in a pipeline. MOVs are considered the on/off switch of the pipe.
Alarm Rationalization: Full Episode Transcript
Russel Treat: Welcome to the Pipeliners Podcast, episode 166, sponsored by the American Petroleum Institute, driving safety, environmental protection, and sustainability across the natural gas and oil industry through world-class standards and safety programs. Since its formation as a standards-setting organization in 1919, API has developed more than 700 standards to enhance industry operations worldwide. Find out more about API at api.org.
Announcer: The Pipeliners Podcast, where professionals, Bubba geeks, and industry insiders share their knowledge and experience about technology, projects, and pipeline operations. Now your host, Russel Treat.
Russel: Thanks for listening to the Pipeliners Podcast. I appreciate you taking the time.
To show the appreciation, we give away a customized YETI tumbler to one listener each episode. This week our winner is Michele Baker with Burns & McDonnell. Congratulations, Michele. Your YETI is on its way. To learn how you can win this signature prize pack, stick around for the end of the episode.
This week, Doug Rothenberg returns for the second of a two-part series on alarm management. We’re going to conclude our conversation talking about alarm rationalization. Doug, once again, welcome back to the Pipeliners Podcast.
Doug Rothenberg: Thank you. I appreciate it.
Russel: I asked you to come back on, and this time to talk about alarm rationalization, because I think alarm rationalization is often misunderstood. It can be a challenging and yet very rewarding process. Maybe the best way to start is what is rationalization and why is it important?
Doug: Rationalization, first of all, it’s a funny word. Let me change the word in our own mind. Alarm rationalization is a, basically, way to take a design for an alarm system and turning it into reality.
What does reality mean? Reality means what are the parameters for the SCADA system or the DCS to actually initiate the alarm?
What are the guidelines we provide for the controllers in the control room to actually respond to that alarm and all of the other parts of the infrastructure that wrap around that, including how to acknowledge the alarm, how to silence the alarm, and how to use the alarm response information? It’s that whole collective thing.
Russel: It’s basically everything about the alarm that a controller would need to understand it and respond to it.
Doug: The enterprise would need to implement it.
Russel: Manage it.
Russel: Exactly. It’s all the data about the alarm. I’m going to put this out here for you to tell me if you think I have this correct or not. I would say that the purpose of rationalization is to equip the organization to have alarms that they can respond to in order to mitigate possible adverse outcomes.
Doug: I would agree with that.
Russel: Oh, dadgummit…Not going to argue with me at all? That’s no fun.
Doug: Wait, we haven’t finished this podcast.
Russel: That’s right. I’m sure we’ll hit something.
There are several pieces of this that are easy to understand. There are some others that are more complicated. I think the easy things are capturing the adverse outcome and capturing what the response should be and what could cause it. That stuff’s relatively straight forward.
Alarm severity and getting that right is often misunderstood. In your opinion, what do people miss about establishing alarm priority?
Doug: The first misconception is that a lot of people confuse alarm priority with words and decoration. We won’t use the words like urgent or critical or moderate or low or medium or high. Those are terms which have a different meaning in all of our minds.
What is priority? Priority basically is the establishment of the level or the concern about any individual abnormal situation. What does it mean to the enterprise if this abnormal situation cannot be managed by the controller in time, either because he runs out of time, he misses it, he does the wrong things, or whatever? Once we encapsulate that risk, that informs priority. What do we do? How do we do that?
We foundationally start from — the pipeline’s operator has, as part of the infrastructure of risk management, how he wants to do risk management. What we do in alarm management is we take his marching orders and definitions and reframe it consistent with what he’s doing so that we can use it for alarm management.
Alarm management priority incorporates what is the exposure to the enterprise to physical loss? What’s the financial exposure? What’s the environmental exposure? What would be the loss of reputation if a certain incident were to happen and the public finds out about it?
We take all those risk-shaping factors and interpret them for an alarm system. The way we use it is we bring up each individual abnormal situation that we design an alarm for. Let’s take a low suction pressure for a compressor, let’s say. We want to protect the compressor for this. What is the risk if we don’t do it right?
We go and look at the alarm priority levels. We find the level that matches the situation that the compressor will see. That’s how we establish priority.
Russel: For people who have a background in risk management or have a background in safety, and particularly if they’re putting together the overall safety program for a company, they’re pretty comfortable with setting up these matrices of, “Well, this is how we’re going to score risk.”
This is a very similar thing. I think, in my understanding, or at least in my experience, better way to say it, when I’ve tried to take the alarm management risk matrix and line it up with the overall company safety risk matrix, that’s both helpful and a bit challenging all at the same time.
It’s helpful in that, if you can align it, it’s easier to get the company to buy into it, agree to it, understand it, and accept it. Oftentimes, at least in my experience, you got to tweak it a little bit to make it appropriate for what’s going on in the control room, versus overall, how the company is operating. What’s been your experience about getting this done?
Doug: Russel, this is a really good introduction into what I’m going to say next. In reality, when we start an alarm risk management assessment, we go to the safety guys or HSE guys, and say, “Hey, guys, give me your risk management definitions,” and things like that. They look, “Well, why do you want to do that?”
We tell them why. They give it to us. We work it into an alarm useful form, usually a matrix with different categories. We show it back to them. They say, “Wow, I didn’t realize that information could be so useful.”
Doug: A couple other times, they basically say, “I didn’t know that’s what it meant.”
Russel: I’ve never heard that exactly, Doug. I’ve had a couple of experiences where the comment about, “We never knew that you could use this for anything useful,” what they’re really saying is, “We never knew that this had direct application operations at that level.”
Russel: It’s funny when you think about it.
Doug: It’s funny, and it’s also very satisfying. It provides a level of consistency between different parts of an organization that thought they might have been adversaries, but are now contributing partners.
Russel: The other thing I like about your work, in particular, is that if you set this up correctly, you do a very similar thing as you do in safety, where you come up with a matrix. The matrix has words in it. You put numbers underneath those words.
As you’re rationalizing an alarm, and you’re asking, “What could the adverse outcome be?” You’re saying, “Well, it could be this from a standpoint of safety, and this is from a standpoint of environment, this from a standpoint of reputation,” you pick those words. Underneath it, it’s calculating a number.
That number is being used to determine whether this is critical, or low, or whatever in the standpoint of severity. I find that extremely resourceful.
I also find it educational. As you begin to look at how your score is actually landed, you’ll find things like, “I have too many critical alarms. I have a bunch of critical alarms. I have no high alarms. Why is that?” That kind of thing.
Those are resourceful questions to ask from an overall automation and alarming program standpoint, about, “Well, no. Why is that? Is that real? Is that because of the nature of the process, or is it how we constructed our facilities? Is it something we got wrong in our risk matrix?”
Doug: This is a very interesting topic. We have to go back to how the controller uses alarm priority. He doesn’t care what names we give them. Doesn’t care whether an alarm comes out at the highest level, the next highest level, the next highest level, or the lowest level.
What he cares is that when he gets an alarm of a priority level, and he gets another one of a different priority level, this tells him which order to work the alarms. This is fundamental.
If an enterprise ends up because they’ve used these definitions, and as you mentioned, converted the definitions into a numerical score, they scored an alarm, and they got the value. If it turns out they have too many highest priority alarms, we know the way the alarms should break out in order to be useful in priority.
What we will do is we’ll adjust the numerical parts of the alarm priorities, so we get alarms in the categories we’ve decided at a proportion that makes sense for risk management. I’ll give the proportions here. Nobody has to take notes.
The very highest priority, we only want a dozen or two alarms in that category. At that highest priority, if that situation is not managed, you’re betting the enterprise. They stand down. The next highest priority level, we think they should be about five percent of all the alarms should be in this category.
The next highest priority alarm, which is the second from the lowest, should be 15 percent of the alarms. 80 percent of the alarms should be in the lowest category. From low to high, we have 80 percent, 15 percent, 5 percent, and just a very few.
When they’re in this ratio, we know that when an alarm activates of a given priority, it accurately reflects the relative risk between priorities.
Russel: What you’re trying to do is you’re trying to set it up so that when the alarm comes in, the controller is going to work the highest severity alarm first.
Russel: What you’re doing is you’re saying if I have something going on, and it’s more than one thing, and the controller needs to make a decision which to work first, I’m taking that decision away. I’m going to presort the work. I’m going to work it by severity, which makes a lot of sense. It can provide a lot of value to the controller if that’s done well.
There’s a couple other things that in your book you talk about. I think it’s interesting. One is, you talk about taking the overall alarm score and adjusting it a bit, either based on the time I have available to respond. In other words, I’m going to make something that I have less time to respond to a little higher score than something I have more time to respond to.
I think that’s interesting and meaningful in this context. [laughs] It’s interesting, because I can talk at great deal and minutiae around this whole idea of priority. Bottom line is you’re trying to help the controller do their work more effectively.
Doug: Exactly. That’s the sole purpose. What does it mean when we pick the alarm of the highest priority first? In the previous podcast, we talked about some foundational principles for an alarm. One of them is you have to provide enough time for the controller to be able to do the job.
First of all, if two or three activate around the same time, you pretty much know that the controller can’t work all of them to completion. Which ones would you rather, I think, work first, knowing that the other ones won’t get a chance to be worked with the ones with the highest impact?
Right away, we satisfy our own gut feeling or intuition. If you’re going to take a hit, let’s take a hit for the lowest priority ones, because that’s less of a hit than the highest priority ones. That’s why we collect them in that order.
Russel: You were talking about a compressor station. It would not be uncommon that if I had an alarm at a compressor station, I might have more than one. Rolling those alarms together by a site has some value, because that reduces my alarm count that I’ve got to work. I’m only working one upset, even though I have multiple alarms.
The other part about that is I want to focus on the thing that has the most severity associated with it. If it’s just a matter of, I’m going to lose a little production, that’s different than I’m going to lose a compressor set, is different than I’m going to have an environmental release.
Russel: I want to deal with that thing that has the most impact first. That’s the whole point of the whole risk management. What I’ve seen a number of people do is get very bound up in trying to figure out how to do that. Your approach to taking a risk management and creating a frame and some scores simplifies and streamlines that conversation.
Doug: Not only it does that, but it makes everybody feel comfortable, because everybody’s dealing out of the same card deck, and they’re using the same risk management protocols that both society feels they should be doing and the enterprise does. It changes the whole accountability from guesswork to engineering and careful work.
Russel: One of the things I want to talk about is this. The thing about doing this rationalization is there is a very long piece of rope that is the entirety of all the things I could put into the rationalization. How do I know what to rationalize, and how do I know how far to go?
Doug: What Russel’s really asking is that if I have a pipeline operation and I’ve got a controller sitting there, and I’ve got an engineer looking over your shoulder and the engineer says to the controller, “You know what? I can alarm almost anything. I can do it.”
What he really means is, “I can have maybe 15,000 configured alarms for you. You have a flow? I can do a high-high, a high, a low, a low-low, a rate of change. I can do all these things for you.” Alarm management basically has to say, “Okay. Slow down. Let’s take this capability of alarming everything and turning it into a tool.”
The way we do that is we don’t look at alarms that we could configure. We basically start with, “What are the ways this pipeline system can get into trouble?” Russel, I know you want to talk about abnormal operating conditions.
You want to pipe in with that and then I’ll pick you up afterwards and finish my discussion?
Russel: Yeah. This is always a challenge because I like to start an alarm rationalization process by getting an inventory of all the AOCs that could occur, and then asking the question, how would I alarm for this specific AOC? In pipelining in particular, almost everything is shown with some kind of pressure change.
The challenge is knowing what’s the root cause of the pressure change. What do I want? Do I want a high-pressure alarm? Do I want an alarm that says, “Oh, I have this blockage in the pipeline”?
What I really want is I want to know that I’ve got a blockage in the pipeline. I don’t necessarily want to know that I have a high pressure. It’s just that, that’s the way I indicate it. That conversation can get real circular real fast if you’re not careful. I do think it’s more valuable.
What happens if you start looking at the AOCs, much like when you’re doing a process hazardous analysis, what you start finding is, “Oh, what I really need is an alarm that does this thing that I don’t currently have available to me as an alarm.”
Doug: Let me try to converge on Russel’s points because they’re all valid. The way to manage the situation is, one, have the people in the room who understand the problem, the people in the room who have to manage the problem after you understand it. They discuss and reach a consensus on what’s the best way to tell the controller he has a problem.
Once we tell him he has a problem, what’s the best way for him to get himself out of that problem? That’s a negotiation process. That’s an expertise process. That’s exactly a cornerstone of what we do when we do an alarm response activity information.
Russel: How do you know where to set the activation point? How do you determine the set point, the value at which I want the pressure to annunciate this is a high pressure?
Doug: The principle when we want a timer is wherever we set it, we have to give the controller enough time to be able to do the things we think we know he needs to do.
We’re going to document that, after letting me have enough time to do that. It’s easy to say because you know, time, but alarm systems don’t work in time. They work in engineering units, values.
If you have a high-pressure situation you want to protect, you have to find a pressure value that translates into time. You have to make sure that’s enough time for the controller to manage it. How do you convert?
We as engineers and designers understand a principle of processed variables and conditions. They have a normal way they respond. We’re not talking about catastrophic failures here, because catastrophic failures, it’s over and done with, the operator or the controller can’t really manage. We’re not talking about disaster management.
In the control room, how fast can this pressure change when things are going wrong, but they’re going wrong in an unreasonable way? You saw it may change in about maybe 10 pounds every half minute. It may change, in terms of flow, half a barrel if it’s liquid or flow. You know how fast these things are changing.
We have to say, how hard the job is for the controller to do the things we need him to do? We’ve already identified what he needs to do. He needs a checklist, look at that, make this, change this value. How long does that take in time?
Maybe it takes a minute or two for him to be able to do that, or three minutes to be able to do that. We know how fast the process is changing. We know how much time he needs. You multiply them together and you get an answer of, “Well, this is going to be 20 pounds.”
You set the alarm point 20 pounds away from the trouble point, which is a point you’re trying to protect. You have a reasonable confidence that in that difference of 20 pounds, there’s enough time for the controller to do all he needs to do to be able to manage it, if managing is in the cards.
Russel: Where this gets complicated is if I’m looking at pipeline facilities where I have to send somebody to manage the response and that sending somebody’s going to take 30 minutes or 45 minutes for somebody to get there.
It starts getting really challenging trying to determine, “This is going to move this fast. I’ve got 30 minutes.” Now, you start getting into, is it even operable at that level?
It does point out…I’m sorry, Doug. It does point out that this calculation of time is critical from an alarm management and rationalization standpoint.
Doug: Correct. What it also does is it informs that situation what you just said. If you need to send somebody out and this is a half an hour or an hour away or even longer, in that period of time, you can’t reasonably expect the process to stay manageable. You already know that you can’t solve this problem with an alarm.
You have to solve it with some other way, maybe some MOVs on the field, some shut down, some ways of automatically reducing pressure flow a little bit until you get that chance to get out there and do something more extensive.
Russel: Right. The other thing is managing an alarm is taking a process to a safe state, right?
Doug: It could be. It could be just restoring some abnormality that an adjustment will solve and you can keep running.
Russel: That’s right, too. I want to transition and talk a little bit about what you think the world needs in the next generation of alarm management systems. What do you think we need in alarm management?
Doug: What Russel just asked is that if we have an enterprise that’s done a really good job of what we now understand as proper alarm management. We’ve gone through, we’ve identified all the abnormal conditions. We’ve identified the best way to alarm them. We found the alarms. We’ve decided on the criticality of the alarm and the alarm occupation point.
We’ve decided all the things the controller needs to do. We’ve got that all in place and working very well. What’s the next step to allow the controller to do even job of what we’ve asked him to do?
There are a couple of things. These are blue sky things. I’m not looking at it from the point of a SCADA implementer, like Russel is, or DCS. I’m looking at it from the point of view of a controller.
First one on my list would be, let me make sure that the alarm response information gets displayed to the controller right there and now. If we’ve already figured out what he needs to do, let’s show it to him in an automated fashion.
Russel: I’ve got to insert that we’ve been doing that in our solution for 10 years now.
Doug: It’s one of the, I wouldn’t say easy, but it’s one of the responsible things to do right out of the box. I’m going to extend a little bit further. In the alarm response sheet, there is a lot of things like check this, look at that, and make sure this, to the extent which the SCADA system or DCS can actually do that checking, like check for the level in something.
We should do that automatically. We shouldn’t let the controller to have to go and find that manually. If we’ve got sensors, we’ve got measurements, if it’s important to check it, let’s have the alarm response set up to do that part of the checking.
The next one, remembering that time is a commodity for the controller. If you have enough time to do anything, everything is possible. Safe operations is really time management. What I would do is to take the alarm information on a SCADA system.
Every alarm displayed, I have a countdown clock which says, “Here’s how long it’s going to take until something bad happens.” Remember, we already have to know that information when we set the alarm activation point. We already have it, so let’s display it to the operator.
Another thing is that when the alarms come up and the alarms change — let’s say the controller’s working on Alarm A, Alarm B comes in — we might want the alarm summary information to SCADA to automatically reorder the alarms for him, so he sees which alarm poses the most risk to the enterprise and not running out of time.
One of the big problems in pipeline operations is keeping track of all the equipment that’s not working up to snuff, or maybe out of service or reduced service. Usually, around those things there are problems that develop. Let’s keep track of all those kinds of things in a more organized, automated way.
Either we tell the controller, “Hey, this is not working so well. Watch it more carefully,” or we may want to inform the alarm system, “Hey, this is not working so well. Do you want to change alarm limits to make sure it’s not getting out of management?”
Russel: That starts getting a little more difficult, changing alarm limits. Generally, that’s under some kind of MOC [management of change] and control process. Doing that on the fly, that infers some complexity that is beyond what you can do and solve for.
Doug: Yes, but remember if you’re going to do it, it has to be pre-engineered. Pre-engineered also means pre-approved. That’s a good point.
Russel: Cool. Doug, I appreciate you coming on and talking about all these. What do you think that pipeliners ought to take away when it comes to rationalization?
Doug: I think that even though it looks hard to do, it looks like it might take a lot of time, there are very efficient ways of doing it. With the right amount of guidance, it can be done very expeditiously.
Everything you put into it, you get out in spades.
Russel: Again, I would absolutely agree with that. The thing that is often the most effective approach is get to something that’s workable and put in place the organizational support and culture so that you can be continually improving it.
The people who have the best ideas for improvement are the ones that are on the consoles working the alarms.
Doug: Exactly, and the team that closely supports them.
Russel: Exactly. Listen, this has been great, as always. Thanks for being on the podcast.
Doug: Thank you for asking me. Those of you who are listening, thank you for listening.
Russel: I hope you enjoyed this week’s episode of the Pipeliners Podcast and our conversation with Doug Rothenberg. Just a reminder before you go, you should register to win our customized Pipeliners Podcast YETI tumbler. Simply visit pipelinepodcastnetwork.com/win to enter yourself in the drawing.
Russel: If you have ideas, questions, or topics you’d be interested in, please let me know on the Contact Us page at pipelinepodcastnetwork.com or reach out to me on LinkedIn. Thanks for listening. I’ll talk to you next week.
Transcription by CastingWords