This week we hear from Fred Schenkelberg on his reliability journey, why Mean Time Between Failure is NOT the best metric to track, and what ARE the best KPIs for your operations.
Join the conversation!
Are you an industry leader in the fields of maintenance and reliability? We want to hear from you! If you would like to be featured as a guest on our podcast, please sign up here.
Stay tuned for more inspiring guests to come in future episodes!
00:00 Ryan: Welcome to Masterminds in Maintenance, a podcast for those with new ideas in maintenance. I’m your host, I’m Ryan, I’m the CEO and founder of UpKeep. Each week I’ll be meeting with a guest who has had an idea for how to shake things up in the maintenance and reliability industry. Sometimes the idea failed, sometimes it made their business more successful, and other times it revolutionized an entire industry. Today’s guest is Fred Schenkelberg, reliability…
00:26 Fred Schenkelberg: You have got it.
00:27 Ryan: [chuckle] I got it. Reliability, engineering, and management consultant extraordinaire. Welcome, Fred, to our show.
00:33 FS: Thanks, Ryan. Thanks for the invite and congratulations on launching a podcast series, that’s awesome.
00:39 Ryan: Of course, and we’re all extremely excited to have you on to today’s show. Maybe you could start us off by sharing a little bit more about yourself and how you got started in this kind of funny industry.
00:52 FS: Well, I have found in talking to lots and lots of people in the reliability world that my circumstance wasn’t unique. My boss walked in one day, I was working in an R and D engineer at a factory, we were making cabling products, cable products, and he said, “We need to know if this will last for 20 years or not. Here, go figure it out.” I’m like, “I have no idea.”
01:21 FS: Yeah. [chuckle] I started my career there as a manufacturing engineer, and so I had the firm belief that it was always the design’s fault, the design team’s fault. If it doesn’t work, it’s a design problem. And it’s hard to argue with that. Now manufacturing can certainly make products worse. We have variability, but if you design for it, that’s a good thing. So, anyway, that’s how I got started. My boss said, “Go figure this out.” So I didn’t even know what reliability was. I still don’t know how to spell it nine times out of 10, and I got thrown into the deep end of the pool, and got a nice technical paper out of the study we did and learned a lot. It still gets cited, which is kind of exciting, and found that I really enjoyed the statistics and the types of questions I got to answer, and that just led to one opportunity after another.
02:11 FS: And then about 15 years ago I went into consulting, from when I was at Hewlett-Packard I switched to consulting, and never missed a beat. I’ve had a blast and worked with hundreds and hundreds of companies and all around the world, and just it’s been fun. And there’s never a lack of things failing, so it’s a never-ending job.
02:33 Ryan: Well, awesome. You mentioned it really started from that one paper, could you share a little bit more about what that paper was, and why it still gets cited today?
02:43 FS: Well, we made a heating cable, I don’t know if you’re familiar with heat tape or heat trace, it’s a cable essentially that you strap onto a pipe for example. Let’s say you’re at a nursing home and you have a hot water heater in the middle of the building, but you want hot water out of the last end of the hallway, you want at the same temperature. So you could re-circulate it, but that’s a pump and then more pipe, and it’s expensive, so we put a heating cable on it that replaces the heat loss so that at the end of the line it’s the same temperature. And so we made a cable that was self-regulating. It had just enough carbon black in it that in the polymer with these two electrodes going right through the middle of it, that if it was cold in one spot, like if you put an ice cube on it, that one inch would heat up. So it was a pretty cool product, and it was just material science that allowed us to make these things. But we didn’t know how they failed. We knew that they lost power, the ability to generate heat over time, but not exactly how, and if it would last 20 years, ’cause this new product was gonna be buried in concrete bridges in the Alps to melt snow, and they didn’t wanna tear it out and replace it every five years, obviously.
04:04 FS: So my boss said, “We need to know if this design will last for 20 years.” And what came out of that was that it’s not a straight line behavior. We could approximate with it, it starts at 100% power and then slowly decays, and what we’ve done in the past is just drew a straight line through that and said, “Here’s our projection to when it fails.” But we found in experience that those products always lasted much longer than we projected. So I did a longer test and more accurate and we measured the curvature of that decay, then we went into the lab and figured out that the outer surface of the polymer oxidized and created a barrier to more oxygen molecules getting deeper into the product. So it’s like oxidation that… Like a tin oxide, it protects the surface from further decay. And so I worked with a professor out of Toronto and, between the two of us, we came up with this technique to model that curvature. And so there was a degradation analysis and it’s technically called the Wiener diffusion process. And so we applied it to that model, and that paper on a novel way to do degradation analysis still get cited.
05:26 Ryan: Wow, that sounds extremely complex, but it also sounds like it had a massive impact on the business that you’re working for, which is amazing.
05:34 FS: Yeah, we did get the business. We did get the Italian… I never got to go to Italy to go see it installed or spend 20 years there to see if it worked.
05:43 Ryan: Oh, man.
05:44 FS: We did get the business, but it also dramatically changed how we tested our products, and how we evaluated it, and how we understood how it failed. And so then that first test just took about six months to do all the analysis and data collection, everything else. And we shortened that after that point because of that understanding, we could only go two, three months and get as good of information almost, so it helped all the way around.
06:10 Ryan: Wow. Alright. I always ask this question, what does reliability really mean to you? Can you share with our audience what reliability engineering is?
06:23 FS: There’s two answers to that. One of them is what it means to me. And I have a short story for you for that. The second part is what we do in reliability engineering is enable the rest of the organization to make better decisions. That’s what we do. We help them understand what will fail and when will it fail. But realizing that as you design and create a system to maintain a product, there’s a lot of decisions that are made. If you’ve got the right information, good information available to you and know how to understand it, analyze it, you will make the right decision. “Do I need a calendar-based program, or do I do it by inspection, or what’s the right way to set up my program or an under-design?: Do I go with vendor A or vendor B, is it ready to launch or not. Part of that decision it’s not just, “We gotta do something.” Oh, yeah, reliability is importan, t but we don’t know how to include that in the decision, so what reliability engineers do is help understand what information’s needed to make the right decision and then either help or get that information to them to make that work. So that’s one way I understand reliability engineering.
07:39 FS: We also get to break things, which is way cool. That’s a sort of thing I like. The other part of it is that it was a Friday afternoon and I was working at Hewlett-Packard in a corporate group and I got a phone call from the sales engineer. I almost never got calls from sales engineers from our own company trying to sell me anything, but she called and said, tomorrow, Saturday morning, I have to go into one of our major clients, or customers, that received the first shipment of seven laptops. Or 10 laptops and seven of them were broken. Broken screens, dead on arrival. And I said, “Well, I’m not the laptop guy.” She said, “Well, we’re working on that.” But I have to go in there and explain why they should not cancel the rest of the order. What is it about the HP’s culture in reliability attitude that’s gonna help us feel confident that the rest of these products are gonna be good. So we talk for about half hour going back and forth about different historical things and internal training and all the different things we do, which I was pretty involved with across the corporation to make sure that our reliability is built in and designed into our product and, by and large, every stage of the way.
09:06 FS: And so she was feeling good about what she was gonna say, but then she stopped and said, “You know, I don’t think I could do this anymore.” And I’m like, “Excuse me, what do you mean?” And she goes, “When I go sell this product, it’s based on… A large part is based on that it’s reliable, that it’s durable, it will do what it says it’s going to do. That we assembled it correctly, we packaged it correctly, we designed it correctly, and it’ll meet or beat your expectations. And that’s not happening. This is not the only time I’ve been disappointed by a shipment to our customer, and I’m their face, I’m the face of this thing.” And then she says, “I’m not willing to jeopardize my integrity by saying it’s gonna be reliable and it’s not.” And so I never did find out whether she stayed in the company or left or whatever, but that sums up what reliability means to me, is that there’s a person on the other end of our processes and our techniques and all of the work we do to get products into the market, and it’s gotta work, it’s gotta do what it’s supposed to do.
10:17 FS: And if there’s an error on the line or we got a bad batch, or the worst is you let a bunch of engineers on the floor to make products [chuckle], it becomes just a bunch of experiments, at the end of the day, somebody’s gonna open that box and turn it on, and if it works, it’s great, if it doesn’t work, you’ve disappointed somebody with a set of expectations being dashed. And so the reliability is very very personal. Every individual interaction with our products reflects on the entire organization, and so reliability is a big part of that.
10:55 Ryan: Absolutely, absolutely. It seems like… Fred, you’ve dedicated your entire career to reliability. It means a lot to you.
11:03 FS: Yeah, yeah. And it pays well too, so it’s not a bad deal. [laughter]
11:07 Ryan: And it seems like you’ve also not only dedicated your career, but you’ve built a company based off of reliability engineering, and I gotta ask the question, working as a consultant now, what’s the most common reason? What’s the trigger point for people to reach out to you guys and say, “Hey, we would love to have a reliability consultant come to us. What typically happens? What do people realize right before they realize they need to call you guys?
11:38 FS: Well, unfortunately, about a third of my clients call when they’re heading towards bankruptcy, because their products… Just the disaster in the field, and they go, “Oops, we should have paid attention to that.” Yeah, and some of those companies make it and others don’t. They get this realization that they made some bad assumptions or they… Yeah, reliability’s there but it’s too expensive or it’ll take too long, and so they just say, “Well, we’ll do our best,” and they don’t do due diligence as a minimum, and then they get surprised that it doesn’t work. About a third of my clients are in that boat. About a third are folks that are looking at their program, their overall system of how they make those decisions, what information they have, what tools they’re using, and they recognize that they could do better. They say, “You know, we keep doing what we’re doing and we keep getting this result, and we need to improve our reliability. What do we need to change?” And so that I get called in to do an assessment essentially and find out how an organization understands and makes decisions concerning reliability, what’s working and what’s not working and so on.
12:50 FS: So about a third my clients are at some stage in their programs, but have a realization they should be doing better. And so it’s not… Probably just an opportunity, they recognize an opportunity there. And then about a third of my clients are, “Can you teach us about FMEA?” Or, “Could you do this data analysis?” Or, “Could you do this specific task?” And so they’re very discrete little projects, and I’ll go do the stuff like that. The hardest ones are the ones that they shipped their product and half of them failed in three weeks, which is a true story for this one client I got. And they had enough money to try again and they were able to launch again, they survived for about a year or so, but their brand identity was just gone, so they never recovered from that. And they made a very, very basic… Do you remember Fitbit? I think they still should sell Fitbit. There was a competitor to theirs and they said, “Well, it’s on your wrist and you gotta be able to use it anywhere and wear it in all day long, and you gotta wash your hands, take a shower, go swimming, it should be alright. How do we test that?” How would you test that? If you’re creating a little wrist band thing, what would you do to figure out if it survived in the water?
14:08 Ryan: You jump in.
14:09 FS: That’s right. Right. Well, they went one step further, they went to Switzerland standards for watches, ’cause watches come with that water-resistant, 50-foot dive watch, things like that. And so they found a 50-foot… I think it was a 50-meter resistance and put it in that deep of water for 30 seconds, pulled it out, shut the water off, and it was still working, “It’s good.” What they didn’t notice is that a week later, all of those samples were dead, ’cause it took about a week for the water to create the corrosion which shorted it out and killed it. The watch standard is, if there’s any water inside of it, it’s a failure. But they had no way of opening their product to see if… Because it was molded. But water still gets in. And so they missed that key element of corrosion takes time to occur, and that’s what water causes most often with a product. And, yeah, they shipped I think 70,000 units and 50,000 failed in two weeks.
15:15 Ryan: Wow. Talk about impact on the business when you oversight reliability. That’s huge.
15:25 FS: It is, but you realize that there’s time to market pressures, we gotta get out to the holiday events. I’ve got three new inventions to do today, I gotta set up a supply chain, I gotta set up a manufacturing line, I gotta set up advertising. There’s a lot to do when launching a product. And reliability isn’t immediately visible, like all those other things are. You either know you got the right packaging in-house, or not. You can see it. But you don’t really know if you’ve got a reliable design until it really gets to your customer. And miss things we can do to poke into that and understand it. But there’s a lot going on, and it’s just one more thing on the long, long list of things that we have to do. Unfortunately, if it’s not designed and manufactured well and it fails, there’s not a lot of recovery from that. And so, unfortunately, those are the ones that I know I can help if they have the ability in the market to recover.
16:27 Ryan: Alright. Well, thanks for that, Fred. You talk a lot about mean time between failure. Tell us a little bit more about what that means to you, why it’s important to track or not track?
16:41 FS: Please never use MTBF. And I haven’t opened up your product to see if that’s central to it or not, but if it is, that’s… Please change that. Do not make that the default. Mean Time Between Failure is an average, it’s the average time to failure for something. If it’s a recurring repaired system, it’s the frequency of time between failures. The units itself, MTBF, is actually an inverse failure rate, it’s not a duration. So if somebody says 50,000 hours MTBF, that does not mean it’s gonna survive for five years without failure. That’s a very common misunderstanding of what that metric means. It’s not a failure free period. It’s one in 50,000 chance every hour that it would fail. That’s what it means. Most people don’t understand that, and then you bring in statistics to it and they ice glaze over and you lose everybody. It’s an average. Depending on what the actual time to failure distribution is, it could be… The average could be very, very short, or very very long, but it doesn’t tell you what the spread is, what the standard deviation is, it’s just a single number. It smooths out stuff, so it basically obscures the exact information you’re looking for.
18:05 FS: So I have an increasing failure rate or a decrease in failure rate. That’s gone with MTBF. I could go on and on. I actually have a whole website devoted to it. It’s called No MTBF, nomtbf.com. And I started this concept, I was doing a presentation at a conference and I was getting introduced and somebody had just asked me a question about my supplier will only give me MTB… No, it was, “I have a customer that wants me to give them an MTBF report.” And he was like, “What is that? Why would anybody want that thing?” So I got on stage and I asked the audience, “How many of you have run into people that are confused about what is MTBF? What does it really mean and how do you use it?” Everybody in the… And these are all reliability folks and quality folks, they all raise their hand. So I threw away my presentation for the next hour and a half. We talked about MTBF and how many ways it can destroy your business. It just is exceedingly poor and misunderstood metric that unfortunately is used as a surrogate for reliability. And so it’s… I write about a lot, because nobody else is saying, “Hey, don’t use this. This is misleading. This is misunderstood. This is not the information that you actually wanna get.” That kind of thing.
19:33 Ryan: Absolutely. And that’s very interesting that you bring this point up. Well, we always talk about here at UpKeep is, it is really important to track metrics. To track a metric and aim to improve better than you were yesterday. Knowing that MTBF might not be the best metric to track, what should we start tracking? What are good metrics to track as a maintenance reliability team?
19:58 FS: For maintenance team, it’s uptime. Right? Availability. But not the MTBF over MTBF plus MTTR. Those are averages that obscure what you’re really looking for. If you have an eight-hour shift where you’re running equipment, or you have a run that goes for a week or six weeks, your ops manager wants to know what’s the probability that if we turn this equipment on and start the shift or start the run, that it’ll complete it? How many hours of our schedule are we gonna actually be producing products? What’s our availability? And its true availability. And in today’s factories, you can measure it directly. You don’t have to estimate it or take samples or anything else.
20:43 FS: Often times that’s the most important one. Now, of course, reliability and time to repair come into play in that. But I’m still interested… At this one factory I work with, we’re canning soup, a big famous soup maker. And what a loud factory, tin cans all over the place. And one of the things they are interested in is… And this was like a 10 million project, “Do we add redundancy or not in order to improve our availability, and our ability to quickly switch the line over from one product to another product?” And what they found is that when they switched size of cans or sizes of bottles for example, that they had a very, very high failure rate, until they got it all tweaked in and tuned in and everything else. And they were working on their change-over methodology also. But they were looking at, if we can get to that lower failure rate quicker, this would make sense. We’ll spend the $10 Million and adjust our lines. If we can’t do that, then it’s not worth doing it, we’ll work on other methods to improve our change-overs. And it was a decreasing failure rate, but the only metric they used was MTBF.
22:01 Ryan: And so if the run went for a week, you got a great MTBF. For the run only went for six hours, it was horrible. And that’s not helping you. You need to see the pattern that’s occurring. Every single shift showed this decreasing failure rate. And so we did a lot more modeling of that changing failure rate and then found that sweet spot of how and what and it balanced with what their investment was in order to get those change-overs to an appropriate failure rate quicker. But if they ignore the changing failure rate, or use the surrogate, which didn’t… Which masks that information, it would have taken them much, much longer to find out what to do and how to move forward with it. So that’s part of why I really don’t like that metric.
22:51 Ryan: Absolutely. Yeah, kind of pivoting a little bit, where do you see the future of reliability going? What do you [22:57] ____?
22:58 FS: Yeah, nobody will be talking about MTBF anymore. That’s one future I’d like to see.
23:02 FS: In this speaking of reliability podcast we were talking about this this morning, and it’s, one of the things we’re seeing is that more and more folks are not… And not just reliability people, but ops managers, design teams, equipment manufacturers, suppliers, everything else, they’re spending less and less time truly understanding failures and failure mechanisms. Now, some are good, there’s no doubt, there’s still some organizations that are… Really go all the way and fully understand things, but we’re fighting more and more people who are not doing that. So when I bring a failed bolt into the shop and say, “What could have caused this?” There’s kind of a blank stare, and it’s, “Did you replace it?” [chuckle] As opposed to asking, “Was it the appropriately designed bolt? Are we using the right standard, or the right hardness, or the right diameter,” or whatever it is? “Was this a defective bolt? Was there something wrong in the metal itself, let’s do the study of the metal. What kind of fracture was it? What can we learn from that?” And you see it every now and then, people share their failure analysis and techniques for doing that kind of work.
24:22 FS: Yet, I see that as declining. So, in order for us to get better at what we’re doing, more and more people have to become better versed, I think my English teacher is cringing, at understanding failure mechanisms. How do things corrode? How do things deform? How do things overheat due to electrical current changes, or resistance changes. Reliability gets involved with material science, electrical engineering, mechanics, all kinds of cool stuff. And even more so now with more and more robotic equipment, we get involved with software. So it’s understanding how those things fail, either from a design decision that was poorly constructed or made, to material variation, to changes in the environment, that changed the stresses that are applied, all the way through to, how does it manifest as a failure? And I see more and more need for that. And so that’s one place I really hope things will go to. And it’s not that we’ve never done it, we’ve got way better tools than we ever had to understand failures, so I think we can get better at it.
25:36 FS: Coupled with that is, with that understanding then we can use this internet of things. We can use the sensors on the equipment, we can monitor vibration live and look for mis-alignments as they occur. All of those things are only possible if you understand the failure mechanisms. Somebody comes in and goes, “Oh, I’m gonna do internet of things,” and they put sensors all over their factory and they get terabytes of data, but there’s no understanding there, they don’t know what to do with that. So it has to come back to truly understanding how things fail. Down to a theoretical level. Then put your sensors in, then take proactive action to prevent failures. But I see there’s gonna be this dramatic need to do that level of understanding in the near future.
26:28 Ryan: So what I’m hearing is, it really starts with the understanding of why. Where do you go to learn, where should others go, and our listeners go, to learn more about understanding failure mechanisms, and the 30 million different ways a bolt can fail and why?
26:45 FS: [chuckle] Yeah. 30 million and six I think it was last time I counted it.
26:48 FS: It’s a lot of ways. It’s a never-ending process. Start with the basics, start with what’s failing now and go talk with your senior folks that understand how it works. Go talk to the chemist and the material scientist. That very first project I worked on, I was in the chemistry lab half the time trying to understand this oxidation process and why it didn’t go all the way through the material, which was what was really happening. Part of it I think what makes a good reliability engineering, is that you’re just curious of how things fail. Now, the downside is that sometimes it makes it difficult to get on an airplane, especially when you see the cracks on the wings and the pilot says, “well, they’re not that big yet.” Like, “Okay.”
27:37 FS: But where do I go? I read and I go to Amazon. It’s one place. It’s a quirk I guess. Is I go to Amazon and I search for books on reliability engineering or reliability in general. And then I sort them by price, by lowest price first. And there’s a bunch of books, really classics that are available for thousands of dollars, which I don’t get, but there’s a whole pile of them for under 5 cents or for 20 cents, and then they make money on shipping I guess, I don’t know. It costs a dollar to ship a book. And so I go get five or six of those every month. And some of them are old, and some of them are dated, but some of them have little gems and insights, and so on. Now, if I was really serious about learning about a particular failure mechanism, I go to the library. It’s all online now. Go back to your Alma Mater and get library privileges.
28:32 FS: I teach at University of Maryland, and so I get faculty access to the library, and so I search and get the technical papers and the books and that are the state-of-the-art understanding of that mechanism, and then use that to design tests, or to alter designs, or suggest improvements. I’m probably doing that every month. I’d say is on some failure mechanism or another, but to get started is there’s some great text books out there that gives you some of the basics. What is corrosion? What is drift? What is creep? Those kinds of things. And then it’s move on from that to what’s relevant for your environment. Then just keep asking questions and keep looking, library is your great resource, and just keep reading, and asking, and experimenting.
29:24 Ryan: Absolutely.
29:26 FS: Where I go is I talk to folks like you. And at Accendo Reliability we have, I think, close to 20 authors that have been active at one point or another. So there’s lots of literature, from engineering management to detailed failure mechanisms. And then clients are a rich source of how things fail. [chuckle] So it never really ends.
29:48 Ryan: Better from worst. Yeah.
29:52 FS: That’s right.
29:53 Ryan: So what I’m hearing is like, it’s all about just being curious, it’s all about reading more. I could also see this idea of going into the many, many different ways a book can fail. Going down a very long, long path, how do you… Any tips, advice, recommendations on how to prioritize different failure mechanisms?
30:20 FS: Yeah, that’s a good question. Usually, for a vast majority of materials and equipment and items we use, like a motor, what’s the basic technology of that motor? A quick Google search will probably find a dozen articles that say, “Here’s the top five ways this can fail.” Once you rule those out, then you can go start looking more deeply. But if those cover 95% of the failures you ever gonna see, it prioritizes pretty quickly. Once you know the basics of… Bolts, for example, tend to fail from shear or metal fatigue, they just get repeatedly loaded or bent and then relax and that creates a brittle fracture. Which appears completely different than a shear, just a complete over-force of it. But that’s, vast majority of them occur from one of those two ways. And so by being able to recognize the pattern in the fracture surface gives you a pretty good clue what you’re dealing with. And then you can go make some more measurements, or do some more analysis of that structure to figure out what are you gonna do about it.
31:35 FS: Electrolytic capacitors, they fail through dielectric breakthrough, the dielectric layer eventually erodes away and they will all fail that way eventually. But if you don’t manufacture it correctly, or it’s not designed correctly, it’ll fail because hydrogen builds up inside of it, and that’s that little cross hatch on the top of electrolytic capacitors. Is so that they can vent when this gas builds up, it’s gotta go somewhere, so they make it easy for it to go in one direction. Unfortunately, it’s really corrosive and it kills everything else around it. But understanding just fundamentals of any device, any component, any fastener, of how they can fail, is the starting point. And that’s just basic knowledge, I think. Not just for reliability engineers, but anybody dealing with liability and maintenance world.
32:32 Ryan: Absolutely. Fred, you are a wealth of knowledge. Where can our listeners find more ways to connect with you, read more of your materials, articles, and learn more about you and what you’re doing?
32:46 FS: Okay. Well, I mentioned accendoreliability.com It’s a platform for folks like you to share your podcast and get the word out to an audience. It continues to grow. We’re bigger than any professional society now, in attendance and engagement. And, I heard it at a conference a couple of years ago, that in one month Accendo Reliability puts up more content, practical useful content, than all of the professional societies that deal with reliability were, combined in a year.
33:22 Ryan: Wow.
33:24 FS: But that sounds pretty amazing. So I went and added it up. And counting the conference papers, we do that every month, we just… There’s a lot of contributors that are willing to share their information. So Accendo Reliability is the platform I’m building with the help of a lot of other people providing content, and a lot of my content is there too. Obviously, I’m up on LinkedIn, and more than happy to connect, and that’s where I share a lot of the new materials that comes out, and other people’s work, and my own work. Answer questions every day for people. That’s where I get ideas for articles, and it keeps me in touch with the industry. And nomtbf.com talks about that, if you’re really interested in diving into that. And, oh, reliability.fm. It’s a website… A podcast network. And so it’s a number of different ways to find my about page and contact info.
34:21 Ryan: Alright. You heard it from Fred. Awesome. Thank you so much again, Fred, for joining us.
34:26 FS: You’re welcome, Ryan. My pleasure.
34:29 Ryan: Absolutely, and thank you to our listeners for tuning in to today’s Masterminds in Maintenance. It was a chat.