Sam Stuewe (halosghost)

Everywhere the Light Touches (And Doesn't)

Slides - No Audio Recording Available - No Video Recording Available

UB creates a large number of headaches for any developer working with C. Below is an edited and reformatted version of a talk I gave on 27 June 2019 (Gregorian) that addresses where UB comes from, how to think about it, and, maybe, how we can live with it.

Behaviors

The C Standard specifies and defines these four types of non-portable behavior:[1]

Unspecified:
use of an unspecified value, or other behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance
Implementation-Defined:
unspecified behavior where each implementation documents how the choice is made
Undefined:
behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements
Locale-Specific:
behavior that depends on local conventions of nationality, culture, and language that each implementation documents

As a general goal, it is admirable to attempt to avoid all forms of non-portable behavior (making programs that can be run anywhere is one of the main benefits of using C). However, even just using printf can result in a variety of unspecified and undefined behaviors. For the purposes of this article, I will only be discussing undefined behavior (because it holds the most pervasive and surprising effects).

To the surprise of many new C programmers, there are quite a few undefined behaviors. In fact, as of the 2018 revision of C, there are 211 undefined behaviors.[2] Thankfully, the C standard actually itemizes all non-portable behaviors in Annex J. In particular, all undefined behaviors are listed in section 2.

Unfortunately, that is not The Bad News™. In particular, the The Bad News™ starts with the first item in the list of undefined behaviors:

A “shall” or “shall not” requirement that appears outside of a constraint is violated (clause 4).
Put more plainly, to make a complete list of the undefined behaviors in C, you would need to read the entire C standard. Over 400 pages (and costing nearly 200$US), this is often not an option that excites most programmers. Though there is a partial solution to the cost,[3] there is no current solution to the time and patience. Being able to have and know the exhaustive list of undefined behaviors is generally not practical. There are tools to help us, and building a better understanding serves as one of the most powerful.

A Mental Model

There are two phrases in the definition of UB that should be interrogated closely to come to a more complete understanding: “upon use”, and “no requirements”.

Upon reading the Standard's definition, programmers may take “upon use” to mean that, once your program's execution reaches a line of code that “invokes” undefined behavior, the rest of the program's behavior is unpredictable. In fact, the effect of undefined behavior is far more pervasive. For lack of a better phrase, undefined behavior “takes effect” at compile-time. That is, if your program's source code includes a line of code anywhere in it that contains undefined behavior, your program is not considered to be valid C, and the compilation of your program and any executions there of are not guaranteed to have any meaning.[4]

When a compiler (i.e., an “implementation”) attempts to compile a bit of code with undefined behavior, it might do any of a few things:

But, “no requirements” means that a compiler doesn't actually need to do any of the options above. Technically, a compiler could also generate instructions that always return the integer 42; it could attempt to format your hard drive; it could also do anything else at all, including launching the missiles. Doing so would be rude, and probably not desireable by programmers, but still compliant as far as the C Standard is concerned.

A Real-World Example

Below is the second, winning submission to John Regehr's Undefined Behavior Consequences Contest:


#include <stdio.h>
#include <stdlib.h>

int main() {
  int *p = (int*)malloc(sizeof(int));
  int *q = (int*)realloc(p, sizeof(int));
  *p = 1;
  *q = 2;
  if (p == q)
    printf("%d %d\n", *p, *q);
}
            

Despite such a short code snippet, it might not be immediately clear what goes wrong in this program. Take a moment to ponder what this code should do, what is wrong with it specifically, and what you imagine it actually will do.

Many programmers will suggest that this code has two possible, reasonable behaviors. Either,

  1. p and q point to different locations following the call to realloc(), and so nothing will be printed; or
  2. p and q point to the same location (easily possible), and so the comparison succeeds and 2 2 will be printed (as 2 was the last value written to that location).

Given the topic of this article, it will not surprise you that neither of these two behaviors is guaranteed. In fact, when compiled with clang -O, the program above prints 1 2. The specific problem with this code is that realloc() is allowed to deallocate the pointer passed to it. And so, using the pointer you pass to realloc is a so-called “use-after-free” bug, a particularly well-known instance of undefined behavior.[5]

From Whence it Came

A cursory search online will yield a rash of criticism for C's undefined behavior. I propose the following thought experiment: What would you suggest be done in the following cases for a language that does not have exception handling?

C was born in a different era. x86 had barely arrived on the scene (and it would be decades before x64 arrived) let alone asserted its dominance in the market; most programmers still programmed in assembly; two's compliment signed integer arithmetic existed, but was hardly universally accepted. Remembering this, many of these low-level considerations are difficult to specify. The C Committee specify things as undefined behavior for a variety of reasons: to enable simpler implementation or optimization, because there are competing standards already implemented by hardware, or because there is no reasonable alternative (e.g., in the case of integer division by 0).

Furthermore, undefined behavior was not wholly created by C and the C Committee (or Dennis Ritchie). In fact, some hardware platforms even have undefined behavior (e.g., situations for which the processor might just halt because of invalid state).[6] In the case of a program creating one of these circumstances, even exception mechanisms will not save your code.

How Do We Cope?

So, finally, The Bad News™: There is no reliable way to know if any codebase, even mostly trivial ones, contains undefined behavior. The bright side is that there are many tools to help us make our code more reliable. Compilers offer a lot of tooling to help: warning flags to let you ask the compiler to tell you if some fishy behaviors might be happening, feature flags to make C behave in more predictable ways (most notoriously in the cases of -fwrapv and -fno-strict-aliasing), and sanitizers (which add runtime code checks to ensure well-defined behavior).[7] You can also leverage static analyzers (e.g., splint and scan-build) to help detect a larger set of possible errors than compilers can detect (in exchange for having to deal with some false positives).[8] You can even leverage some formal methods tooling to help prove your code correct (my personal favorite being frama-c).[9]

If a safer language will work well for your use-case and your users, carefully weigh the benefits of C against the negatives. For those of us that do not yet have such an option, we must strive to leverage the tools available, and to read the manuals and standards as much as possible.

Further Reading / Prior Art

Along with all the references in the footnotes, I would recommend reading the two following articles as well; both go far more in-depth than this crash-course and are lovely references:

  1. §3.4. ISO/IEC 9899:2018
  2. Annex J, Section 2. ISO/IEC 9899:2018
  3. Thankfully, the final drafts of each standard revision (which tend to be very close to the published standard) are made freely available. The final draft of what has become C18 has been archived.
  4. Regehr, John. “A Guide to Undefined Behavior in C and C++”. https://blog.regehr.org/archives/213.
  5. §7.22.3.3, line number 2. ISO/IEC 9899:2018
  6. For example, the unofficially named HCF instructions. See Halt and Catch Fire for more.
  7. The GNU Compiler Collection documentation, Section 3.16
  8. The wikipedia page on tools for static analysis includes a section for C and C++.
  9. A truly excellent tutorial to get started with frama-c was written by Allan Blanchard.