Quantcast
Channel: Cadence Blogs
Viewing all articles
Browse latest Browse all 6681

NASA: "Never Have Another Accident Due to Our Organizational Flaws"

$
0
0
The keynote at the IRPS reliability conference I attended was by Nancy Currie-Gregg. Today she works at the NASA Engineering and Safety Center, but she has been on four space shuttle missions to Hubble and the International Space Station, in 1993, 1995, 1998, and 2002. You might think that the semiconductor industry has little to learn from NASA, but as we get more involved in safety-critical applications like automotive, there are lots of areas where we can learn from them. Madame Experience runs a very expensive school, only fools will learn in no other. Nancy told us how, on her first mission, she had written a letter to her six-year-old daughter, and sealed it in an envelope, to be opened only if she didn't return. By the fourth mission, the daughter was 16. Happily, four times she has got the letter back and tore it up. She was on a flight that got bumped ahead of ST-107 due to priority of a Hubble service mission. ST-107 was Columbia. Those astronauts didn't get to tear up their envelopes. She gave a vivid description of the launch. At T-6, the engines come up to power and you are not in any doubt that you are not in a simulator this time. 7M pounds of thrust kick in. By the time the shuttle clears the top of the launch tower, you are already at 100mph. It is shake-rattle-roll. Once the boosters come off then it is smooth, with gradual acceleration up to 3G. It takes just eight and a half minutes to get to orbit. In Houston, on a bad traffic day, you go maybe a mile and a half. High-Reliability Organizations She talked about High-Reliability Organizations (HROs), which constantly confront the unexpected but operate with remarkable consistency and effectiveness. Human spaceflight is inherently risky, and as a result NASA has to operate as an HRO. In the US/NASA human space program, there were three fatal accidents: January 27, 1967: Apollo 1 fire on launchpad January 28, 1986: Space shuttle Challenger during ascent February 1 2003: Space shuttle Columbia during re-entry The statistics are somewhat brutal. 17 crew members lost in those three accidents out of a total of 350 crew members. So if you flew three or four times as she did, the overall risk is high. The most recent accident, the Columbia, she was involved with. They were her friends and colleagues, and she was deployed within 24 hours to help with recovery. One lesson that was learnt almost instantly was the false sense that 98% of the risk was in launch and ascent. Another earlier lesson had been discovered earlier was that the US and Russian space programs did not learn from each other. The Apollo 1 fire had rescue efforts delayed since the hatch could not be opened due to pressure from the fire, and the hatch opened inward. Unknown to the US, the Russians had had a fatal accident in 1961, where the rescue efforts were delayed since the hatch opened inward and could not be opened. Had NASA known, they would already have put in place the rule that exists today: all hatches in a spacecraft open outward. I will not go over the physical aspects of the accident to the Columbia since you probably know them already. Foam loss from the external tank contacted the orbiter and exceeded the impact tolerance, and as a result, superheated air during re-entry penetrated the wing structure, causing loss of control and eventual breakup of the orbiter. Less than two pounds of insulation brought down the vehicle (it did impact at 550 mph). The accident investigation board identified two other causal factors. First, "the shuttle program does not consistently demonstrate the characteristics of organizations that effectively manage high risk." Second, "the accident was probably not a random event but rather rooted to some degree in NASA's history and the space flight program's culture." Organizational Culture Looking at that last item, cultural causes were: Reliance on past success as a substitute for sound engineering practice Organizational barriers that prevented effective communications of critical safety information and stifled differences of professional opinion Lack of integrated management Evolution of an informal chain of command and decision-making processes that operated outside the rules A quote from the board report: "Managers created huge barriers against dissenting opinions by stating preconceived conclusions based on subjective knowledge and experience, rather than solid data." If you have been involved in a large software or semiconductor design project, especially one that is off the rails, you have probably seen some of this. The engineers in the trenches know that aspects of the project are out of control, but the higher in management you go, the more everything seems to be on track and on schedule. Since missing the market window is not acceptable, and can be fatal to the company if it is small, senior management might decree that it will tape out on time, based on their gut feel back when they were managing a design project themselves and the chip worked. NASA calls the normalization of deviance, when catastrophe doesn't occur and so whatever is being done wrong becomes the norm. This is a gradual process where unacceptable practice becomes acceptable. Changing this is hard. In NASA's case, it requires changing the culture of thousands of civil service engineers and tens of thousands of contractors. Solving the technical problem is easy compared to increasing the clarity of risk. As semiconductor markets move from mobile, where failure is annoying but not life threatening, to automotive, where lives are on the line, then companies and design teams need to learn some of these lessons from organizations like NASA that has learnt them the hard way. The biggest challenges are not the technical ones but the culture of engineering management. There is a risk that what "worked" for chips up until now is assumed to be state of the art, and thus will "work" for these more life-critical designs that the industry is "driving" towards at full speed. NASA did have a safety organization, but it lacked resources, independence, and authority. Senior management had to accept that authority can and should de delegated, but accountability cannot be relinquished. It arises from responsibility. And as General Colin Powell said, "Being responsible sometimes means pissing people off." The key change that came out of the inquiry was the creation of a robust and independent program technical authority that has no connection to, or responsibility for, schedule or program cost. They needed to get away from the situation described by the board's chairman, Admiral Harold Gehman: The safety organization sits right beside the shuttle person making the decisions, but behind the safety organization there is nothing there: no people, money, engineering, expertise, analysis. The is no 'there' there. The National Engineering and Safety Center So the National Engineering and Safety Center, where Nancy works, was set up in July 2003, with people and money. The challenge is to keep it working since history shows that a strong focus on safety is easier right after a critical event, but maintaining that vigilance for years is necessary to prevent future accidents. After creating it, they expected to get a lot of anonymous tips, but in fact one-third of the requests come from program managers. The organization has its own budget and expertise, and so can provide value-added independent assessment. The goal of the NESC is the quote I picked for the title of this post: Ensure we never have another accident that is attributable to our organizational flaws. Q & A There were a couple of interesting questions afterwards. Q: Will there be another manned mission? Will India, China, and Elon Musk beat NASA? What about Mars? A: Going to Mars, the limiting factor is not the technical capability, but the knowledge of the effect on the human body, which we need to learn about. Musk says he doesn't worry about this, he just worries about getting them there. NASA does worry about it. Q: What options were available with Columbia if the damage had been identified? A: Afterwards, a re-worked risk assessment was 1 in 2. Had NASA known, weight could have been reduced and the trajectory altered. The vehicle was two to three seconds from the peak heating regime, and so buying just a few more seconds and it might have made it. The orbiter could have stayed up for 16-17 days while things were assessed and decided.

Viewing all articles
Browse latest Browse all 6681

Trending Articles