Oakridge Labs Visit
September 16, 2013, 5:54 pm
I recently returned from a visit to Oakridge National Lab’s Computer Science and Math Division.  I gave a talk there on “Building a Reliable System out of Unreliable Components”.  A few observations from my trip:

  • It was nice to see a significant UNM presence at Oak Ridge.   Barney Maccabe, a former Professor at UNM, is the division director.  James Horey and Manju Venkata are two former UNM PhD students who now work as researchers in the division.  It’s always nice to visit a place where there are  familiar faces.
  • It was great to talk to a large number of HPC researchers, and get an update on the state of the art.  I learned that silent faults where single bits are flipped are now the most common faults, occurring at a rate of about 1 per hour.  The trend in HPC of trying to reduce power to individual gates suggests that this rate will only increase.
  • I heard that there are many postdoc opportunities at ORNL, and not enough qualified applicants.  This mystifies me since the lab seems like a great place to work.  I certainly met many smart, productive people who seemed happy to work there.
  • ORNL is distinguished as the only national lab that allows deer and turkey hunting on the grounds.  I wonder when Los Alamos will follow its lead?

ICIS Future of the Field Workshop
September 5, 2012, 6:10 pm
A few weeks ago I gave a talk on network reliability at the ICIS “Future of the Field” workshop on High Performance Computing[1]. This talk surveys three main theoretical approaches we have for building reliable distributed systems from unreliable components. When writing the talk, I was struck by how few theoretical tools we actually have for this problem: Byzantine agreement and state replication; Secure Multiparty Computation; and a circuit based approach proposed by Von Neumann which seems to have been more or less abandoned 20 years ago.  Am I missing anything?

