# Machinations

On Networks of Noisy Gates
September 12, 2011, 6:24 pm
Filed under: Uncategorized | Tags: , , ,

In 1952, Von Neumann proposed the following problem.  You are given a function f that can be computed with a circuit of n gates.  You have to build a new circuit to compute f, but all the gates in your new circuit are unreliable. In particular, each new gate fails independently with some small probability epsilon, and when a gate fails, it computes the NOT of what it should compute.  What is the smallest number of gates you need in order to compute f with a probability of error strictly less than 1/2?

Turns out that in general you need Theta(n log n) gates.  The upper bound was shown by Von Neumann, and the lower bound for the exclusive-or function was first proved by Pippenger, Stamoulis and Tsitsiklis (after a previous incorrect attempt by others).

Today I want to talk about the upper bound result.  You might think this is pretty easy – just duplicate each gate log n times, do majority filtering, and use Chernoff bounds to show that the majority output will be correct.  Unfortunately, duplication and majority filtering alone fails since the fraction of faulty wires will increase by a constant with each level of the circuit.

To fix this, you need a sampler (aka a concentrator).  A sampler is just a special type of bipartite graph.  In the simplest case, there are the same number of nodes on the top as on the bottom and the graph is d-regular for some constant d.  The very useful property of a sampler is that it reduces error in the following sense.  If some fraction x of the nodes in the top are “bad”, then the fraction of nodes on the bottom that have a majority of bad neighbors is less than x/4.  We can make the denominator an arbitrary large constant by increasing d, but 4 will suffice for our purposes.  For more about samplers, including how to construct them and other applications, see Section 3.2.1 of our Breaking the O(n^2) Bit Barrier paper.

Once, you’ve got a sampler, Von Neumann’s construction is pretty easy and is illustrated in the figure below.  In the upper right corner of the figure is a noiseless gate G with two input wires and a single output wire.  The large circuit to the left replicates this functionality with noisy gates.  In the large circuit, in place of the two input wires, we have two “cables”, each consisting of Theta(log n) wires.  In place of G, we have Theta(log n) replicas of G.  One wire from each cable goes into each of these replicas.  Below this, we have wires connected via a sampler to a bunch of simple circuits (“Maj”) that simply compute the majority of their inputs.

These majority circuits (Maj in the figure) are themselves also noisy.  However, since they have constant degree input, its not hard to construct them so that they fail only with constant probability.

The names “Executive Organ” and “Restoring Organ” used in the figure are Von Neumann’s original terminology.  A bit colorful perhaps, but who are we to change it?

The simple analysis of this circuit is illustrated below. Adjacent to each cable is now an upper bound on the fraction of wires in the cable whose output is faulty.  Each of these upper bounds holds with high probability (whp),  i.e. a probability of error that is polynomially small in n). Adjacent to each gate is the probability that the gate is faulty.

We assume inductively that the fraction of bad wires in each of the input cables is no more than \theta for some fixed constant \theta.  The first observation is that the fraction of faulty wires coming out of the gates G is then no more than 2\theta + 2\epsilon.  To see this, note that, whp, the fraction of gates that fails is no more than 2\epsilon by Chernoff bounds.  In the worst case, each faulty gate and each faulty input wire causes its own separate faulty output, which gives us  2\theta + 2\epsilon faulty wires.

Next notice that by the properties of the sampler, if all the Maj circuits are correct, we would have a factor of 4 decrease in the fraction of faulty wires leaving these, so that the fraction of faulty output wires would be only 1/2(\theta + \epsilon).  But again by Chernoff bounds, whp, at most 2\epsilon of the Maj circuits will be faulty.  Thus we have at most a 1/2(\theta + \epsilon) + 2\epsilon fraction of faulty wires coming out and this last term is no more than \theta for \theta large enough but depending only on \epsilon.

The final thing to note is that the fraction of faulty wires on the final output cable is the same as the fraction of faulty wires on the two input cables.  Thus, the fraction of faulty wires does not increase with the depth of the circuit!

This is a remarkable result, particularly since it dates back to 1956!  I’d argue that the Von Neumann fault model continues to have the same sway over CS as the Von Neumann architecture.  Our fault models still generally assume that a linear number of processing elements are made faulty, and we seek to tolerate these faults while minimizing resource costs with respect to computation in a world without faults. Unfortunately, a resource blowup of a logarithmic factor seems to be completely intolerable to practitioners – especially if in the common case, there are no faults at all!  This has been a barrier to use of these types of results.
How can we go beyond this model in order to get more practical robustness?  Should we use properties of modern computational networks, i.e. their ability to be reconfigured and to allow for feedback? Should we use a more modern type of competitive analysis: i.e. assume an adversary with a fixed but unknown budget b of processing units that it can corrupt, and then demand that our resource blowup be competitive with b instead of n?  Should we try to get improved results for specialized classes of functions, rather than try to attack the general problem for all types of functions?

Consensus

Consensus[1] is arguably one of the most fundamental problem in distributed computing.  The basic idea of the problem is: n players each vote on one of two choices, and the players  want to hold an election to decide which choice “wins”.  The problem is made more interesting by the fact that there is no leader, i.e. no single player who everyone knows is trustworthy and can be counted on to tabulate all the votes accurately.

Here’s a more precise statement of the problem.  Each of n players starts with either a 0 or a 1 as input.  A certain fraction of the players (say 1/3) are bad, and will collude to thwart the remaining good players.  Unfortunately, the good players have no idea who the bad players are.  Our goal is to create an algorithm for the good players that ensures that

• All good players output the same bit in the end
• The bit output by all good players is the same as the input bit of at least one good player

At first, the second condition may seem ridiculously easy to satisfy and completely useless.  In fact, it’s ridiculously hard to satisfy and extremely useful.  To show the problem is hard, consider a naive algorithm where all players send out their bits to each other and then each player outputs the bit that it received from the majority of other players.  This algorithm fails because the bad guys can send different messages to different people.  In particular, when the vote is close, the bad guys can make one good guy commit to the bit 0 and another good guy commit to the bit 1.

The next thing I want to talk about is how useful this problem is.   First, the problem is broadly useful in the engineering sense of helping us build tools.  If you can solve consensus, you can solve the problem of how to build a system that is more robust than any of its components.  Think about it: each component votes on the outcome of a computation and then with consensus it’s easy to determine which outcome is decided by a plurality of the components (e.g. first everyone sends out their votes, then everyone calculates the majority vote, then everyone solves the consensus problem to commit to a majority vote.)   It’s not surprising then that solutions to the consensus problem have been used in all kinds of systems ranging from flight control, to data bases, to peer-to-peer; Google uses consensus in the Chubby system; Microsoft uses consensus in Farsite; most structured peer-to-peer systems (e.g. Tapestry, Oceanstore) use consensus in some way or another.  Any distributed system built with any pretension towards robustness that is not using consensus probably should be.

But that’s just the engineering side of things.  Consensus is useful because it allows us to study synchronization in complex systems.  How can systems like birds, bees, bacteria, markets come to a decision even when there is no leader.  We know they do it, and that they do it robustly, but exactly how do they do it and what is the trade-off they pay between robustness and the time and communication costs of doing it?  Studying upper and lower bounds on the consensus problem gives insight into these natural systems.  The study of how these agreement building processes, or quorum sensing, occur in nature has become quite popular lately, since they occur so pervasively.

Consensus is also useful because it helps us study fundamental properties of computation.  One of the first major result on consensus due to Fischer, Lynch and Patterson, in 1982, was that consensus is impossible for any deterministic algorithm with even one bad player (in the asynchronous communication model).  However, a follow up paper by Ben-Or showed that with a randomized algorithm, it was possible to solve this problem even with a constant fraction of bad players, albeit in exponential time.  This was a fundamental result giving some idea of how useful randomization can be for computation under an adversarial model.  As a grad student, I remember taking a class with Paul Beame telling us how impressed he was by what these two results said about the power of randomness when they first came out.  Cryptography was also shown to be useful for circumventing the Fischer, Lynch and Patterson result, and I’ve heard of several prominent cryptographers  who were first drawn to that area at the time because of its usefulness in solving consensus.

In the next week or two, I’ll go into some of the details of recent results on this problem that make use of randomness and cryptography.  Early randomized algorithms for consensus like Ben-Or’s used very clever tricks, but no heavy duty mathematical machinery.  More recent results, which run in polynomial time, make use of more modern tricks like the probabilistic method, expanders, extractors, samplers and connections with error-correcting codes, along with assorted cryptographic tricks.  I’ve been involved on the work using randomness, so I’ll probably start there.

[1] The consensus problem is also frequently referred to as the Byzantine agreement problem or simply agreement.  I prefer the name consensus, since it is more succinct and descriptive. While the research community has not yet reached a “consensus” on a single name for this problem, in recent years, the name consensus is being used most frequently.