GAO reproaches NNSA for nuclear simulation supercomputer disaster recovery plans

Email LinkedIn
Tools

Should the supercomputers used to model changes to the U.S. nuclear weapon stockpile fail due to a disaster, it's not clear whether the national laboratories in charge of them could easily recover their capability, says the Government Accountability Office.

The GAO, in a report dated Dec. 9, faults Lawrence Livermore, Los Alamos and Sandia national labs for not fully implementing a contingency and disaster recovery plan for their classified supercomputers. It also chides the National Nuclear Security Administration for not providing sufficient oversight and not tracking spending on supercomputer recovery efforts--or, for having available cost estimates on what a recovery might cost.

Since the United States stopped underground nuclear tests in 1992, it has relied on supercomputers to provide modeling and simulation data in lieu of live tests. As such, supercomputing capabilities at the three national labs are central to national security, but both the labs and NNSA consider  the loss of system availability to be low. Only Los Alamos has conducted a formal business impact analysis to determine the impact of a loss, the report says, and the analysis lacked specificity.

Should one supercomputing site go down, another national lab might be able to take on the additional work, but barriers exist to actually doing that, auditors found.

For example, in 2011, the Los Alamos lab projects it will have about 6.5 times the capacity in teraflops than Livermore and nearly 19 times the capacity of Sandia. As a result, there will be a significant centralization of supercomputing capacity in Los Alamos. Were Los Alamos supercomputers to suffer damage, neither Livermore nor Sandia would obviously possess sufficient capacity to assume the workload. But it's questionable whether the other two labs would be able to assume even the minimum computational workload required for emergency processing priorities, since none of the labs has calculated what that workload amounts to, the report says.

All three labs have been backing up their data at a primary and alternative site, but Los Alamos has been doing so at sites located less than a mile away from each other. Should one backup storage facility be threatened with destruction, it's possible the second site could be just as threatened, the report notes.  

National labs were also unclear on who is responsible for oversight of contingency and disaster recovery planning. NNSA advanced simulation and computing personnel told GAO that the chief information officer was responsible, who in turn told the GAO that the ASC program office had responsibility. ASC personnel, when confronted with what CIO officials said, said CIO officials were right, says the report.

In fact, the office of the CIO was wrong, according to the GAO. The CIO should be responsible for disaster recovery, the GAO says, and NNSA said in its response to the report that it would undertake an effort to clarify the CIO's role.

But, NNSA disagreed with the GAO recommendation to identify the minimum capacity needed to meet stockpile stewardship requirements, saying that different types of supercomputers service different functions. The GAO argues that since national labs have identified supercomputing processing needed for normal business operations, they could identify the minimum needed in the event of a service disruption.

NNSA also disagreed with the GAO's recommendations on tracking expenditures on contingency and disaster recovery, saying that breaking those costs would not add significant value to managing contingency and disaster recovery.

For more:
 - download the report, GAO-11-67 (.pdf)

Related Articles:
Disused hard drives strewn about Oak Ridge laboratory, says IG
IG: Tighten cybersecurity to protect nuclear weapons