Rajit Manohar: Metastability

On the (non) existence of bounded time metastability detectors

Rajit Manohar
Cornell NYC Tech, New York, NY 10011
February 21, 2014
Technical note: AVLSI-TN-2014-1

Since the publication of papers on the topic of timing speculation, I have had to review innumerable incorrect papers on related ideas that all exhibit a lack of understanding of the problem of metastability [1]. This is not surprising because the issue is quite subtle, textbooks don't actually explain it very well, and there are a large number of published papers that make the same mistake. People that build on results from other papers don't always spend the effort on carefully verifying them (indeed if they did research would not move forward---we would put the "re" back into research).

There appears to be a widely held belief that it is possible to build a "metastability detector" circuit that can respond in a bounded amount of time. When the question gets asked: "what happens if there is metastability"? it is met with the quick reply: "oh, detect it and take corrective action."

The purpose of this note is to provide a simple, clear explanation as to why this is impossible--as far as we know today. I don't believe this is a new result; but the collection of papers that I went through (e.g. by Marino and others) did not state exactly what I wanted. If someone reading this can point me to a reference, that would be appreciated.

It is clear that bounded time metastability resolution cannot exist if we assume classical physics; whether quantum mechanics provides an out is yet to be seen but many attempts have failed thanks to the uncertainty principle [2].

To make the argument about metastability detectors as simple as possible, I will state it by relying on an impossibility result that most researchers are aware of, and then connecting that to the metastability detector problem.

A one-bit synchronizer is a circuit that has two inputs: a data input, and a clock input. The data input is sampled by the synchronizer using the positive edge of the clock. The output of the synchronizer is the sampled data value. If the clock edge arrives when the input is changing, the output of the synchronizer is permitted to be 0 or 1. Synchronizers are required on asynchronous inputs to synchronous circuits. It is known that we cannot build a failure-free circuit that has this property---there is a non-zero probability that the output will be metastable. In practice these circuits are designed to make their failure probability small enough so that the overall system-level failure rate is acceptable. (The good news is that error probability dies exponentially with the delay budgeted for the synchronizer circuit.) This brings us to our assumption---which is an assumption because we don't have a definitive answer from quantum mechanics (we know it is true in the classical case).

Assumption. It is impossible to build a failure-free bounded time synchronizer.

What is a metastability detector? The simplest definition I have seen is that it is a circuit with one input and one output. The output is a 0 if the input is not metastable, and 1 if the input is metastable.

How does one define a bounded-time metastability detector? This question is actually more subtle than one might expect. The obvious answer is: a metastability detector that produces the correct output in a bounded amount of time. This interpretation is fraught with peril. Consider the following scenarios:

Is an oscillator a bounded time metastability detector? A careful reader will realize that it technically satisfies the requirements stated so far. So obviously it is important to say when the output of the bounded time metastability detector is correct---and once it is correct it doesn't change, unless something happens to its input.
But what does it mean for "something to happen" to its input? To me, the only clear definition of "input change" when the input can be metastable is to include an implicit reference to the clock. Why is that? Because suppose I have a flip-flop whose output is connected to the bounded-time metastability detector. The whole point of bounded time is to say that the output of the metastability detector is synchronous with respect to the flip-flop's clock---or at least safe to sample some fixed number of clock ticks in the future. So there has to be a well-defined "start time" for measuring the bounded delay, and the most natural reference is the flip-flop's clock edge.
What if one defines the start time or "being metastable" by referencing a specific property of the flip-flop output and then state that the flip-flop output is not metastable at the clock edge, but then suddenly becomes metastable half-way through the cycle or at some delayed point that we can't control? The issue is that it is the job of the metastability detector to let us know in a bounded amount of time that some signal becomes metastable. We must be able to reference the delay relative to the sample clock in a way that does not depend on any property of the input data signal. It doesn't make a lot of sense to say "well my metastability detector takes bounded time, but it starts at an unknown time"---if that were acceptable, then we could also say that metastability is a bounded phenomenon---just start counting time from a fixed delay before the metastability resolves itself. (Or to be more picky, it is bounded with probability 1, as the possibility that the metastability will never resolve is a measure zero event.)
Note that saying "unknown w.r.t. to the clock but bounded" is still the same as what I am assuming, because I can just add that as an additional bound on the time taken by the metastability detector, and go back to referencing time with respect to the clock edge.

In the construction that follows, I only use the metastability detector with an input that is the sampled output of a flip-flop, and assume bounded time is with respect to the sampling clock.

Claim. It is impossible to build a failure-free bounded delay metastability detector circuit.
Proof. Let M be a bounded-time metastability detector with inputs M.i and M.o. Consider the following circuit:

One flip-flop F, with data input F.i, clock input F.clk, and output F.o (the clock input is implicit)
One M circuit, input M.i and output M.o
One 2-input or gate, with inputs O.i1, O.o2 and output O.o

Consider the following circuit C, with inputs C.i, C.clk, and C.o:

F.i is connected to C.i
F.o is connected to M.i and O.i1
M.o is connected to O.i2
F.clk is connected to C.clk
O.o is connected to C.o

If M exists, the claim is that C is a failure-free bounded time synchronizer.

Why is that the case? Because if F is not going to be metastable, then M.o tells us in a bounded amount of time. In that case the output of the OR gate is F.o (a copy of F.i at the clock edge), which is valid because we know it is not metastable. If F is metastable, then M.o = 1 which means the output of the OR gate is 1, ignoring F.o. This is also valid, since the synchronizer can report any value when the input is changing. Hence the output of the OR gate satisfies the requirements of a synchronizer.

I have omitted a lot of details to provide the basic intuition. I've also assumed that the metastability detector is fast enough that the OR gate output is produced before the next clock edge.

What if M is really slow and larger than the clock period of the sampling clock? We have a basic issue here which is that the metastability detector has a lower throughput than the data it is supposed to operate on. If we assume that a "wave of input changes" can propagate through M, and they stay spaced out, then we can insert a matched delay on the other input to the OR gate and use a delayed version of the sample clock to sample the output of the OR gate when we know it is safe. If all we know is that M can only operate say once every k clock cycles, we can have k parallel detectors that operate in a phase staggered manner to keep up with the sampled data throughput. All of these are possibilities but the reality is their feasibility would depend on the specific internals of a proposed M.

If we are happy with not requiring full-throughput sampling, the circuit can be easily modified with a clock enable. The clock enable goes high when a sample is needed, and cannot go high again for at least k clock cycles, where k is large enough so that k clock periods are longer than the time required by the metastability detector and OR gate. That is equivalent to requiring that the effective clock frequency in the circuit above is low enough.

In summary, since we know C does not exist and we know that F and O do exist, the only conclusion to be drawn is that M does not exist.

Is that so terrible? Not really, apart from causing some minor embarassment---minor because, as I noted earlier, this is a subtle issue that is rarely explained well. Is this catastrophic for design? Not really, because it is well-known that we can make the failure probability as small as we want by paying for it in latency. Failure rates below 10^-50 can be achieved with practical circuits. It is just that there is a difference between zero failure and a very very small probability of failure. Note that one can reliably identify the exit from metastability [3]. The issue is that we cannot place a bound on how long that will take.

A final comment. I have taken some liberties with terminology such as saying "an input is metastable." To be precise, it really means that the circuit that generated the input is in a metastable state, and the input signal samples part of that state.

I hope this note is helpful to anyone that sees it, and helps to dispell the myth of the bounded time metastability detector circuit. I also hope some of the discussion helps people appreciate the tricky issues involved.

Acknowledgments
I would like to thank Alain J. Martin and Ivan Sutherland for their feedback and suggestions.

References
[1] T.J. Chaney and C. Molnar. Anomalous Behavior of Synchronizer and Arbiter Circuits. IEEE Transactions on Computers, Vol. C-22, No. 4, pp. 421-422, April 1973.
[2] L. Lamport. Buridan's Principle. Foundations of Physics 42, 8 (August 2012) 1056-1066.
[3] C.L. Seitz. "System Timing," Chapter 7 in Mead & Conway, Introduction to VLSI Systems, Addison-Wesley (1980).