dwww Home | Show directory contents | Find package

# Data flow analysis: an informal introduction

## Abstract

This document introduces data flow analysis in an informal way. The goal is to
give the reader an intuitive understanding of how it works, and show how it
applies to a range of refactoring and bug finding problems.

Data flow analysis is a well-established technique; it is described in many
papers, books, and videos. If you would like a more formal, or a more thorough
explanation of the concepts mentioned in this document, please refer to the
following resources:

*   [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)).
*   Videos on the PacketPrep YouTube channel that introduce lattices and the
    necessary background information:
    [#20](https://www.youtube.com/watch?v=73j_FXBXGm8),
    [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8),
    [#22](https://www.youtube.com/watch?v=saOG7Uooeho),
    [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g),
    [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc),
    [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw).
*   [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE)
*   [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf).
*   [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf).
*   [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/).
*   [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf)
    (a paper that successfully applies symbolic execution to real-world
    software).

## Data flow analysis

### The purpose of data flow analysis

Data flow analysis is a static analysis technique that proves facts about a
program or its fragment. It can make conclusions about all paths through the
program, while taking control flow into account and scaling to large programs.
The basic idea is propagating facts about the program through the edges of the
control flow graph (CFG) until a fixpoint is reached.

### Sample problem and an ad-hoc solution

We would like to explain data flow analysis while discussing an example. Let's
imagine that we want to track possible values of an integer variable in our
program. Here is how a human could annotate the code:

```c++
void Example(int n) {
  int x = 0;
  // x is {0}
  if (n > 0) {
    x = 5;
    // x is {5}
  } else {
    x = 42;
    // x is {42}
  }
  // x is {5; 42}
  print(x);
}
```

We use sets of integers to represent possible values of `x`. Local variables
have unambiguous values between statements, so we annotate program points
between statements with sets of possible values.

Here is how we arrived at these annotations. Assigning a constant to `x` allows
us to make a conclusion that `x` can only have one value. When control flow from
the "then" and "else" branches joins, `x` can have either value.

Abstract algebra provides a nice formalism that models this kind of structure,
namely, a lattice. A join-semilattice is a partially ordered set, in which every
two elements have a least upper bound (called a *join*).

```
join(a, b) ⩾ a   and   join(a, b) ⩾ b   and   join(x, x) = x
```

For this problem we will use the lattice of subsets of integers, with set
inclusion relation as ordering and set union as a join.

Lattices are often represented visually as Hasse diagrams. Here is a Hasse
diagram for our lattice that tracks subsets of integers:

![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg)

Computing the join in the lattice corresponds to finding the lowest common
ancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of
literature on efficiently implementing LCA queries for a DAG, however Efficient
Implementation of Lattice Operations (1989)
([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911),
[doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that
particularly well-suited for programmatic implementation.

### Too much information and "top" values

Let's try to find the possible sets of values of `x` in a function that modifies
`x` in a loop:

```c++
void ExampleOfInfiniteSets() {
  int x = 0; // x is {0}
  while (condition()) {
    x += 1;  // x is {0; 1; 2; …}
  }
  print(x);  // x is {0; 1; 2; …}
}
```

We have an issue: `x` can have any value greater than zero; that's an infinite
set of values, if the program operated on mathematical integers. In C++ `int` is
limited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is
still really big.

To make our analysis practical to compute, we have to limit the amount of
information that we track. In this case, we can, for example, arbitrarily limit
the size of sets to 3 elements. If at a certain program point `x` has more than
3 possible values, we stop tracking specific values at that program point.
Instead, we denote possible values of `x` with the symbol `⊤` (pronounced "top"
according to a convention in abstract algebra).

```c++
void ExampleOfTopWithALoop() {
  int x = 0;  // x is {0}
  while (condition()) {
    x += 1;   // x is ⊤
  }
  print(x);   // x is ⊤
}
```

The statement "at this program point, `x`'s possible values are `⊤`" is
understood as "at this program point `x` can have any value because we have too
much information, or the information is conflicting".

Note that we can get more than 3 possible values even without a loop:

```c++
void ExampleOfTopWithoutLoops(int n) {
  int x = 0;  // x is {0}
  switch(n) {
    case 0:  x = 1; break; // x is {1}
    case 1:  x = 9; break; // x is {9}
    case 2:  x = 7; break; // x is {7}
    default: x = 3; break; // x is {3}
  }
  // x is ⊤
}
```

### Uninitialized variables and "bottom" values

When `x` is declared but not initialized, it has no possible values. We
represent this fact symbolically as `⊥` (pronounced "bottom").

```c++
void ExampleOfBottom() {
  int x;    // x is ⊥
  x = 42;   // x is {42}
  print(x);
}
```

Note that using values read from uninitialized variables is undefined behaviour
in C++. Generally, compilers and static analysis tools can assume undefined
behavior does not happen. We must model uninitialized variables only when we are
implementing a checker that specifically is trying to find uninitialized reads.
In this example we show how to model uninitialized variables only to demonstrate
the concept of "bottom", and how it applies to possible value analysis. We
describe an analysis that finds uninitialized reads in a section below.

### A practical lattice that tracks sets of concrete values

Taking into account all corner cases covered above, we can put together a
lattice that we can use in practice to track possible values of integer
variables. This lattice represents sets of integers with 1, 2, or 3 elements, as
well as top and bottom. Here is a Hasse diagram for it:

![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg)

### Formalization

Let's consider a slightly more complex example, and think about how we can
compute the sets of possible values algorithmically.

```c++
void Example(int n) {
  int x;          // x is ⊥
  if (n > 0) {
    if (n == 42) {
       x = 44;    // x is {44}
    } else {
       x = 5;     // x is {5}
    }
    print(x);     // x is {44; 5}
  } else {
    x = n;        // x is ⊤
  }
  print(x);       // x is ⊤
}
```

As humans, we understand the control flow from the program text. We used our
understanding of control flow to find program points where two flows join.
Formally, control flow is represented by a CFG (control flow graph):

![CFG for the code above](DataFlowAnalysisIntroImages/CFGExample.svg)

We can compute sets of possible values by propagating them through the CFG of
the function:

*   When `x` is declared but not initialized, its possible values are `{}`. The
    empty set plays the role of `⊥` in this lattice.

*   When `x` is assigned a concrete value, its possible set of values contains
    just that specific value.

*   When `x` is assigned some unknown value, it can have any value. We represent
    this fact as `⊤`.

*   When two control flow paths join, we compute the set union of incoming
    values (limiting the number of elements to 3, representig larger sets as
    `⊤`).

The sets of possible values are influenced by:

*   Statements, for example, assignments.

*   Joins in control flow, for example, ones that appear at the end of "if"
    statements.

**Effects of statements** are modeled by what is formally known as a transfer
function. A transfer function takes two arguments: the statement, and the state
of `x` at the previous program point. It produces the state of `x` at the next
program point. For example, the transfer function for assignment ignores the
state at the previous program point:

```c++
// GIVEN: x is {42; 44}
x = 0;
// CONCLUSION: x is {0}
```

The transfer function for `+` performs arithmetic on every set member:

```c++
// GIVEN: x is {42, 44}
x = x + 100;
// CONCLUSION: x is {142, 144}
```

**Effects of control flow** are modeled by joining the knowledge from all
possible previous program points.

```c++
if (...) {
  ...
  // GIVEN: x is {42}
} else {
  ...
  // GIVEN: x is {44}
}
// CONCLUSION: x is {42; 44}
```

```c++
// GIVEN: x is {42}
while (...) {
  ...
  // GIVEN: x is {44}
}
// CONCLUSION: {42; 44}
```

The predicate that we marked "given" is usually called a precondition, and the
conclusion is called a postcondition.

In terms of the CFG, we join the information from all predecessor basic blocks.

![Modeling the effects of a CFG basic block](DataFlowAnalysisIntroImages/CFGJoinRule.svg)

Putting it all together, to model the effects of a basic block we compute:

```
out = transfer(basic_block, join(in_1, in_2, ..., in_n))
```

(Note that there are other ways to write this equation that produce higher
precision analysis results. The trick is to keep exploring the execution paths
separately and delay joining until later. However, we won't discuss those
variations here.)

To make a conclusion about all paths through the program, we repeat this
computation on all basic blocks until we reach a fixpoint. In other words, we
keep propagating information through the CFG until the computed sets of values
stop changing.

If the lattice has a finite height and transfer functions are monotonic the
algorithm is guaranteed to terminate.  Each iteration of the algorithm can
change computed values only to larger values from the lattice. In the worst
case, all computed values become `⊤`, which is not very useful, but at least the
analysis terminates at that point, because it can't change any of the values.

Fixpoint iteration can be optimised by only reprocessing basic blocks which had
one of their inputs changed on the previous iteration. This is typically
implemented using a worklist queue. With this optimisation the time complexity
becomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and
`|L|` is the size of lattice used by the analysis.

## Symbolic execution: a very short informal introduction

### Symbolic values

In the previous example where we tried to figure out what values a variable can
have, the analysis had to be seeded with a concrete value. What if there are no
assignments of concrete values in the program? We can still deduce some
interesting information by representing unknown input values symbolically, and
computing results as symbolic expressions:

```c++
void PrintAbs(int x) {
  int result;
  if (x >= 0) {
    result = x;   // result is {x}
  } else {
    result = -x;  // result is {-x}
  }
  print(result);  // result is {x; -x}
}
```

We can't say what specific value gets printed, but we know that it is either `x`
or `-x`.

Dataflow analysis is an istance of abstract interpretation, and does not dictate
how exactly the lattice and transfer functions should be designed, beyond the
necessary conditions for the analysis to converge. Nevertheless, we can use
symbolic execution ideas to guide our design of the lattice and transfer
functions: lattice values can be symbolic expressions, and transfer functions
can construct more complex symbolic expressions from symbolic expressions that
represent arguments. See [this StackOverflow
discussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation)
for a further comparison of abstract interpretation and symbolic execution.

### Flow condition

A human can say about the previous example that the function returns `x` when
`x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by
tracking a flow condition. A flow condition is a predicate written in terms of
the program state that is true at a specific program point regardless of the
execution path that led to this statement. For example, the flow condition for
the program point right before evaluating `result = x` is `x >= 0`.

If we enhance the lattice to be a set of pairs of values and predicates, the
dataflow analysis computes the following values:

```c++
void PrintAbs(int x) {
  int result;
  if (x >= 0) {
    // Flow condition: x >= 0.
    result = x;   // result is {x if x >= 0}
  } else {
    // Flow condition: x < 0.
    result = -x;  // result is {-x if x < 0}
  }
  print(result);  // result is {x if x >= 0; -x if x < 0}
}
```

Of course, in a program with loops, symbolic expressions for flow conditions can
grow unbounded. A practical static analysis system must control this growth to
keep the symbolic representations manageable and ensure that the data flow
analysis terminates. For example, it can use a constraint solver to prune
impossible flow conditions, and/or it can abstract them, losing precision, after
their symbolic representations grow beyond some threshold. This is similar to
how we had to limit the sizes of computed sets of possible values to 3 elements.

### Symbolic pointers

This approach proves to be particularly useful for modeling pointer values,
since we don't care about specific addresses but just want to give a unique
identifier to a memory location.

```c++
void ExampleOfSymbolicPointers(bool b) {
  int x = 0;     // x is {0}
  int* ptr = &x; // x is {0}      ptr is {&x}
  if (b) {
    *ptr = 42;   // x is {42}     ptr is {&x}
  }
  print(x);      // x is {0; 42}  ptr is {&x}
}
```

## Example: finding output parameters

Let's explore how data flow analysis can help with a problem that is hard to
solve with other tools in Clang.

### Problem description

Output parameters are function parameters of pointer or reference type whose
pointee is completely overwritten by the function, and not read before it is
overwritten. They are common in pre-C++11 code due to the absence of move
semantics. In modern C++ output parameters are non-idiomatic, and return values
are used instead.

Imagine that we would like to refactor output parameters to return values to
modernize old code. The first step is to identify refactoring candidates through
static analysis.

For example, in the following code snippet the pointer `c` is an output
parameter:

```c++
struct Customer {
  int account_id;
  std::string name;
}

void GetCustomer(Customer *c) {
  c->account_id = ...;
  if (...) {
    c->name = ...;
  } else {
    c->name = ...;
  }
}
```

We would like to refactor this code into:

```c++
Customer GetCustomer() {
  Customer c;
  c.account_id = ...;
  if (...) {
    c.name = ...;
  } else {
    c.name = ...;
  }
  return c;
}
```

However, in the function below the parameter `c` is not an output parameter
because its field `name` is not overwritten on every path through the function.

```c++
void GetCustomer(Customer *c) {
  c->account_id = ...;
  if (...) {
    c->name = ...;
  }
}
```

The code also cannot read the value of the parameter before overwriting it:

```c++
void GetCustomer(Customer *c) {
  use(c->account_id);
  c->name = ...;
  c->account_id = ...;
}
```

Functions that escape the pointer also block the refactoring:

```c++
Customer* kGlobalCustomer;

void GetCustomer(Customer *c) {
  c->name = ...;
  c->account_id = ...;
  kGlobalCustomer = c;
}
```

To identify a candidate function for refactoring, we need to do the following:

*   Find a function with a non-const pointer or reference parameter.

*   Find the definition of that function.

*   Prove that the function completely overwrites the pointee on all paths
    before returning.

*   Prove that the function reads the pointee only after overwriting it.

*   Prove that the function does not persist the pointer in a data structure
    that is live after the function returns.

There are also requirements that all usage sites of the candidate function must
satisfy, for example, that function arguments do not alias, that users are not
taking the address of the function, and so on. Let's consider verifying usage
site conditions to be a separate static analysis problem.

### Lattice design

To analyze the function body we can use a lattice which consists of normal
states and failure states. A normal state describes program points where we are
sure that no behaviors that block the refactoring have occurred. Normal states
keep track of all parameter's member fields that are known to be overwritten on
every path from function entry to the corresponding program point. Failure
states accumulate observed violations (unsafe reads and pointer escapes) that
block the refactoring.

In the partial order of the lattice failure states compare greater than normal
states, which guarantees that they "win" when joined with normal states. Order
between failure states is determined by inclusion relation on the set of
accumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order
between normal states is determined by reversed inclusion relation on the set of
overwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of
overwritten fields).

![Lattice for data flow analysis that identifies output parameters](DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg)

To determine whether a statement reads or writes a field we can implement
symbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer
dereference operator and `MemberExpr`s.

### Using data flow results to identify output parameters

Let's take a look at how we use data flow analysis to identify an output
parameter. The refactoring can be safely done when the data flow algorithm
computes a normal state with all of the fields proven to be overwritten in the
exit basic block of the function.

```c++
struct Customer {
  int account_id;
  std::string name;
};

void GetCustomer(Customer* c) {
  // Overwritten: {}
  c->account_id = ...; // Overwritten: {c->account_id}
  if (...) {
    c->name = ...;     // Overwritten: {c->account_id, c->name}
  } else {
    c->name = ...;     // Overwritten: {c->account_id, c->name}
  }
  // Overwritten: {c->account_id, c->name}
}
```

When the data flow algorithm computes a normal state, but not all fields are
proven to be overwritten we can't perform the refactoring.

```c++
void target(bool b, Customer* c) {
  // Overwritten: {}
  if (b) {
    c->account_id = 42;     // Overwritten: {c->account_id}
  } else {
    c->name = "Konrad";  // Overwritten: {c->name}
  }
  // Overwritten: {}
}
```

Similarly, when the data flow algorithm computes a failure state, we also can't
perform the refactoring.

```c++
Customer* kGlobalCustomer;

void GetCustomer(Customer* c) {
  // Overwritten: {}
  c->account_id = ...;    // Overwritten: {c->account_id}
  if (...) {
    print(c->name);       // Unsafe read
  } else {
    kGlobalCustomer = c;  // Pointer escape
  }
  // Unsafe read, Pointer escape
}
```

## Example: finding dead stores

Let's say we want to find redundant stores, because they indicate potential
bugs.

```c++
x = GetX();
x = GetY();
```

The first store to `x` is never read, probably there is a bug.

The implementation of dead store analysis is very similar to output parameter
analysis: we need to track stores and loads, and find stores that were never
read.

[Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a
generalization of this idea, which is often used to answer many related
questions, for example:

* finding dead stores,
* finding uninitialized variables,
* finding a good point to deallocate memory,
* finding out if it would be safe to move an object.

## Example: definitive initialization

Definitive initialization proves that variables are known to be initialized when
read. If we find a variable which is read when not initialized then we generate
a warning.

```c++
void Init() {
  int x;    // x is uninitialized
  if (cond()) {
    x = 10; // x is initialized
  } else {
    x = 20; // x is initialized
  }
  print(x); // x is initialized
}
```

```c++
void Uninit() {
  int x;    // x is uninitialized
  if (cond()) {
    x = 10; // x is initialized
  }
  print(x); // x is maybe uninitialized, x is being read, report a bug.
}
```

For this purpose we can use lattice in a form of a mapping from variable
declarations to initialization states; each initialization state is represented
by the followingn lattice:

![Lattice for definitive initialization analysis](DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg)

A lattice element could also capture the source locations of the branches that
lead us to the corresponding program point. Diagnostics would use this
information to show a sample buggy code path to the user.

## Example: refactoring raw pointers to `unique_ptr`

Modern idiomatic C++ uses smart pointers to express memory ownership, however in
pre-C++11 code one can often find raw pointers that own heap memory blocks.

Imagine that we would like to refactor raw pointers that own memory to
`unique_ptr`. There are multiple ways to design a data flow analysis for this
problem; let's look at one way to do it.

For example, we would like to refactor the following code that uses raw
pointers:

```c++
void UniqueOwnership1() {
  int *pi = new int;
  if (...) {
    Borrow(pi);
    delete pi;
  } else {
    TakeOwnership(pi);
  }
}
```

into code that uses `unique_ptr`:

```c++
void UniqueOwnership1() {
  auto pi = std::make_unique<int>();
  if (...) {
    Borrow(pi.get());
  } else {
    TakeOwnership(pi.release());
  }
}
```

This problem can be solved with a lattice in form of map from value declarations
to pointer states:

![Lattice that identifies candidates for unique_ptr refactoring](DataFlowAnalysisIntroImages/UniquePtrLattice.svg)

We can perform the refactoring if at the exit of a function `pi` is
`Compatible`.

```c++
void UniqueOwnership1() {
  int *pi;             // pi is Compatible
  pi = new int;        // pi is Defined
  if (...) {
    Borrow(pi);        // pi is Defined
    delete pi;         // pi is Compatible
  } else {
    TakeOwnership(pi); // pi is Compatible
  }
  // pi is Compatible
}
```

Let's look at an example where the raw pointer owns two different memory blocks:

```c++
void UniqueOwnership2() {
  int *pi = new int;  // pi is Defined
  Borrow(pi);
  delete pi;          // pi is Compatible
  if (smth) {
    pi = new int;     // pi is Defined
    Borrow(pi);
    delete pi;        // pi is Compatible
  }
  // pi is Compatible
}
```

It can be refactored to use `unique_ptr` like this:

```c++
void UniqueOwnership2() {
  auto pi = make_unique<int>();
  Borrow(pi);
  if (smth) {
    pi = make_unique<int>();
    Borrow(pi);
  }
}
```

In the following example, the raw pointer is used to access the heap object
after the ownership has been transferred.

```c++
void UniqueOwnership3() {
  int *pi = new int; // pi is Defined
  if (...) {
    Borrow(pi);
    delete pi;       // pi is Compatible
  } else {
    vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible
    print(*pi);
    use(v);
  }
  // pi is Compatible
}
```

We can refactor this code to use `unique_ptr`, however we would have to
introduce a non-owning pointer variable, since we can't use the moved-from
`unique_ptr` to access the object:

```c++
void UniqueOwnership3() {
  std::unique_ptr<int> pi = std::make_unique<int>();
  if (...) {
    Borrow(pi);
  } else {
    int *pi_non_owning = pi.get();
    vector<unique_ptr<int>> v = {std::move(pi)};
    print(*pi_non_owning);
    use(v);
  }
}
```

If the original code didn't call `delete` at the very end of the function, then
our refactoring may change the point at which we run the destructor and release
memory. Specifically, if there is some user code after `delete`, then extending
the lifetime of the object until the end of the function may hold locks for
longer than necessary, introduce memory overhead etc.

One solution is to always replace `delete` with a call to `reset()`, and then
perform another analysis that removes unnecessary `reset()` calls.

```c++
void AddedMemoryOverhead() {
  HugeObject *ho = new HugeObject();
  use(ho);
  delete ho; // Release the large amount of memory quickly.
  LongRunningFunction();
}
```

This analysis will refuse to refactor code that mixes borrowed pointer values
and unique ownership. In the following code, `GetPtr()` returns a borrowed
pointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned
pointer. We don't distinguish between these two assignments, and we want each
assignment to be paired with a corresponding sink; otherwise, we transition the
pointer to a `Conflicting` state, like in this example.

```c++
void ConflictingOwnership() {
  int *pi;           // pi is Compatible
  pi = GetPtr();     // pi is Defined
  Borrow(pi);        // pi is Defined

  pi = new int;      // pi is Conflicting
  Borrow(pi);
  delete pi;
  // pi is Conflicting
}
```

We could still handle this case by finding a maximal range in the code where
`pi` could be in the Compatible state, and only refactoring that part.

```c++
void ConflictingOwnership() {
  int *pi;
  pi = GetPtr();
  Borrow(pi);

  std::unique_ptr<int> pi_unique = std::make_unique<int>();
  Borrow(pi_unique.get());
}
```

## Example: finding redundant branch conditions

In the code below `b1` should not be checked in both the outer and inner "if"
statements. It is likely there is a bug in this code.

```c++
int F(bool b1, bool b2) {
  if (b1) {
    f();
    if (b1 && b2) {  // Check `b1` again -- unnecessary!
      g();
    }
  }
}
```

A checker that finds this pattern syntactically is already implemented in
ClangTidy using AST matchers (`bugprone-redundant-branch-condition`).

To implement it using the data flow analysis framework, we can produce a warning
if any part of the branch condition is implied by the flow condition.

```c++
int F(bool b1, bool b2) {
  // Flow condition: true.
  if (b1) {
    // Flow condition: b1.
    f();
    if (b1 && b2) { // `b1` is implied by the flow condition.
      g();
    }
  }
}
```

One way to check this implication is to use a SAT solver. Without a SAT solver,
we could keep the flow condition in the CNF form and then it would be easy to
check the implication.

## Example: finding unchecked `std::optional` unwraps

Calling `optional::value()` is only valid if `optional::has_value()` is true. We
want to show that when `x.value()` is executed, the flow condition implies
`x.has_value()`.

In the example below `x.value()` is accessed safely because it is guarded by the
`x.has_value()` check.

```c++
void Example(std::optional<int> &x) {
  if (x.has_value()) {
    use(x.value());
  }
}
```

While entering the if branch we deduce that `x.has_value()` is implied by the
flow condition.

```c++
void Example(std::optional<int> x) {
  // Flow condition: true.
  if (x.has_value()) {
    // Flow condition: x.has_value() == true.
    use(x.value());
  }
  // Flow condition: true.
}
```

We also need to prove that `x` is not modified between check and value access.
The modification of `x` may be very subtle:

```c++
void F(std::optional<int> &x);

void Example(std::optional<int> &x) {
  if (x.has_value()) {
    // Flow condition: x.has_value() == true.
    unknown_function(x); // may change x.
    // Flow condition: true.
    use(x.value());
  }
}
```

## Example: finding dead code behind A/B experiment flags

Finding dead code is a classic application of data flow analysis.

Unused flags for A/B experiment hide dead code. However, this flavor of dead
code is invisible to the compiler because the flag can be turned on at any
moment.

We could make a tool that deletes experiment flags. The user tells us which flag
they want to delete, and we assume that the it's value is a given constant.

For example, the user could use the tool to remove `example_flag` from this
code:

```c++
DEFINE_FLAG(std::string, example_flag, "", "A sample flag.");

void Example() {
  bool x = GetFlag(FLAGS_example_flag).empty();
  f();
  if (x) {
    g();
  } else {
    h();
  }
}
```

The tool would simplify the code to:

```c++
void Example() {
  f();
  g();
}
```

We can solve this problem with a classic constant propagation lattice combined
with symbolic evaluation.

## Example: finding inefficient usages of associative containers

Real-world code often accidentally performs repeated lookups in associative
containers:

```c++
map<int, Employee> xs;
xs[42]->name = "...";
xs[42]->title = "...";
```

To find the above inefficiency we can use the available expressions analysis to
understand that `m[42]` is evaluated twice.

```c++
map<int, Employee> xs;
Employee &e = xs[42];
e->name = "...";
e->title = "...";
```

We can also track the `m.contains()` check in the flow condition to find
redundant checks, like in the example below.

```c++
std::map<int, Employee> xs;
if (!xs.contains(42)) {
  xs.insert({42, someEmployee});
}
```

## Example: refactoring types that implicitly convert to each other

Refactoring one strong type to another is difficult, but the compiler can help:
once you refactor one reference to the type, the compiler will flag other places
where this information flows with type mismatch errors. Unfortunately this
strategy does not work when you are refactoring types that implicitly convert to
each other, for example, replacing `int32_t` with `int64_t`.

Imagine that we want to change user IDs from 32 to 64-bit integers. In other
words, we need to find all integers tainted with user IDs. We can use data flow
analysis to implement taint analysis.

```c++
void UseUser(int32_t user_id) {
  int32_t id = user_id;
  // Variable `id` is tainted with a user ID.
  ...
}
```

Taint analysis is very well suited to this problem because the program rarely
branches on user IDs, and almost certainly does not perform any computation
(like arithmetic).

Generated by dwww version 1.15 on Sat May 18 09:05:50 CEST 2024.