Sheaves and Probability, Part 3

(Part 1, Part 2)

The distributed algorithm for inference I described last time is kind of awkward, and not very well linked with the rest of the field. Olivier Peltre has a better one. It’s connected with the well-known message-passing algorithm for this sort of calculation. Unfortunately, while I think I have a decent idea of the general idea of what’s going on, I don’t have a very detailed understanding of all the moving pieces, so some of this exposition might be just a little bit wrong.

We will need to give up a bit of our nice simplicial structure to make everything work properly. We will instead work directly with a cover of our set of random variables $\mathcal I$ by a collection of sets $\mathcal V$ , where we require that $\mathcal V$ be closed under intersection. This requirement makes $\mathcal V$ a partially ordered set under the relation $\alpha \leq \beta$ if $\beta \subseteq \alpha$ . (The change in direction is to ensure that our sheaves stay sheaves and our cosheaves stay cosheaves.) The state spaces for each $\alpha$ still form a sheaf (i.e., a covariant functor) over $\mathcal V$ ; and so we also get the cosheaf $\mathcal A$ of observables and the sheaf $\mathcal A^*$ of functionals. (This is probably terrible terminology for anyone not comfortable with cellular sheaves, since typically we want to think of a presheaf as a contravariant functor. Sorry.)

We can move back to the simplicial world by taking the nerve $\mathcal N(\mathcal V)$ of $\mathcal V$ : this is the simplicial complex whose $k$ -simplices are nondegenerate chains of length $k+1$ in $\mathcal V$ . In particular, vertices of $\mathcal{N}(\mathcal V)$ correspond to elements of $\mathcal V$ , while edges correspond to pairs $\alpha \leq \beta$ . A sheaf or cosheaf on $\mathcal V$ extends naturally to $\mathcal{N}(\mathcal V)$ . The stalk over a simplex $\alpha \leq \cdots \leq \beta$ is $\mathcal F(\beta)$ , and the restriction maps between 0- and 1-simplices are $\mathcal F(\alpha) \to \mathcal F(\alpha \leq \beta)$ , $x_\alpha \mapsto \mathcal F_{\alpha \leq \beta} x_\alpha$ , and $\mathcal F(\beta) \to \mathcal F(\alpha \leq \beta)$ , $x_\beta \mapsto x_\beta$ .

Extending a sheaf or cosheaf to $\mathcal N(\mathcal V)$ preserves homology and cohomology (defined in terms of derived functors and all that), so this is not an unreasonable thing to do. The interpretations of $H_0(\mathcal N(\mathcal V);\mathcal A)$ and $H^0(\mathcal N(\mathcal V);\mathcal A^*)$ are also the same. Homology classes of $\mathcal A$ correspond to Hamiltonians defined as a sum of local terms, and cohomology classes of $\mathcal A^*$ (when normalized) correspond to consistent local marginals (or pseudomarginals).

We now come to the part of the framework I have always found most confusing, because it’s seldom very well explained. Peltre (and others, probably) makes a distinction between the interaction potentials $h_\alpha$ which are summed to get a Hamiltonian on $S$ , and the collection of local Hamiltonians $H_\alpha$ , obtained by treating each subset $\alpha$ as if it were an isolated system and summing the potentials $h_\beta$ for $\beta \subseteq \alpha$ . Both of these can be seen as 0-cochains of $\mathcal A$ , and they are not really homologically related. There is an invertible linear map $\zeta: C_0(\mathcal{N}(\mathcal V);\mathcal A) \to C_0(\mathcal{N}(\mathcal V);\mathcal A)$ which sends a collection of interaction potentials to its family of local Hamiltonians. The formula is straightforward: $(\zeta h)_\alpha = \sum_{\beta \subseteq \alpha} \mathcal A_{\alpha \leq \beta} h_\beta$ . The inverse of $\zeta$ is the Möbius transform $\mu$ given by $(\mu H)_\alpha = \sum_{\beta \subseteq \alpha} c_{\beta} \mathcal{A}_{\alpha \leq \beta} H_\beta$ . The coefficients $c_{\beta}$ are the Möbius numbers of the poset $\mathcal V$ . Notice that these formulas refer to the poset relations, not to the incidence of simplices in $\mathcal N(\mathcal V)$ , which is a bit frustrating—the simplicial structure doesn’t seem to really be doing much here. However, $\zeta$ and $\mu$ can be extended to chain maps, and hence descend to homology. There are analogous operations on cochains of $\mathcal A^*$ , adjoint to the operations on the chains of $\mathcal A$ .

The action of $H^0(\mathcal N(\mathcal V);\mathcal A^*)$ on $H_0(\mathcal N(\mathcal V);\mathcal A)$ calculates the expected value of the Hamiltonian $H$ given by a class of interaction potentials $h_\alpha$ with respect to a consistent set of marginals. If a chain represents a collection of local Hamiltonians, then this pairing doesn’t compute an expectation. Instead, if $p \in H^0$ and $H \in C_0$ , we have to take $\langle p, \mu \cdot H\rangle$ to get the expectation. By adjointness, this is $\langle \mu\cdot p, H\rangle$ .

For reasons I don’t understand very deeply, these operations play a key role in formulating message passing dynamics. My current understanding is that it has to do with the gradients of the Bethe entropy. The Bethe entropy is defined on a section of $\mathcal A^*$ by the same inclusion-exclusion procedure as we did before, but using the Möbius coefficients to perform inclusion-exclusion: $\check{S}(p) = \sum_{\alpha} c_\alpha S(p_\alpha)$ . Think of this as computing local (i.e. marginalized) entropy and then applying the Möbius transform.

The Bethe free energy is the functional we minimize to find the approximate pseudomarginals: $\langle p, h \rangle - \check{S}(p)$ . Adding the constraint that $p$ be consistent ( $\delta p = 0$ ) and normalized ( $\langle p_\alpha, \mathbf{1}_\alpha \rangle = 1$ ), we get the Lagrangian condition

h + \mu (\log p + \mathbf{1}) + \partial \lambda + \tau = 0

which we can then convert to

-\log p - \mathbf{1} = \zeta(h + \partial \lambda + \tau).

The Lagrange multiplier $\tau$ is an arbitrary 0-chain which is constant on each vertex, so its zeta transform also satisfies the same condition, and we can roll the $\mathbf 1$ into it. $\lambda$ is an arbitrary 1-chain. Ultimately, we have the requirement that

-\log p = \zeta(h + \partial \lambda) + \tau.

This holds on the level of cochains, and of course the cochain $p$ must also be in $H^0(\mathcal{N}(\mathcal V),\mathcal A^*)$ . So in order to be a critical point for the Bethe free energy functional, $-\log p$ must be obtained as the zeta transform of something homologous to the given local potential $h$ , up to some normalizing additive constant. Solving the marginalization problem amounts to a search for a local potential $h'$ homologous to the given $h$ that makes $p = \exp(-\zeta h')$ a section of $\mathcal A^*$ . Note that every section of $\mathcal A^*$ is normalized to some constant sum, so the normalization constraint isn’t all that interesting. Further, if $p = \exp(-\zeta h')$ , it is automatically nonnegative.

One of the key insights now is that if we define a differential equation on $h$ where the derivative of $h$ is always in the image of $\delta$ , the evolution of the state will be restricted to the homology class of the initial condition. If we define an operator on $C_0(\mathcal{N}(\mathcal V);\mathcal A)$ which vanishes when $\exp(-\zeta h')$ is a section of $\mathcal A^*$ , we’ll be able to use it to implement just such a differential equation.

Let $\Delta = \partial \circ \mathcal D \circ \zeta$ , where $\mathcal D: C_0(\mathcal N(V);\mathcal A^*) \to C_1(\mathcal N(\mathcal V); \mathcal A)$ is defined by $(\mathcal D H)_{\alpha \leq \beta} = -\log(\mathcal A^*_{\alpha \leq \beta}\exp(-H_\alpha)) +H_\beta \in \mathcal A(\alpha \leq \beta) = \mathcal A(\beta)$ . This operator clearly vanishes when $\exp(-\zeta h)$ is a section of $\mathcal A^*$ , since $\mathcal D \circ \zeta$ does. In fact, $\Delta$ vanishes at precisely these points. This operator $\Delta$ is somewhat mysterious, but if you squint hard enough it looks kind of like a Laplacian. In fact, its linearization around a given 0-chain appears to be a sheaf Laplacian.

Thus the system of differential equations $\dot{h} = -\Delta h$ has stationary points corresponding precisely to the critical points of the Bethe entropy. If you discretize this ODE with a naive Euler scheme with step size 1, you get a discrete-time evolution equation, and this implies dynamics on local marginal estimates $p_\alpha \propto \exp(-(\zeta h)_\alpha)$ . It turns out that these dynamics on $p$ are precisely the standard message-passing dynamics, typically described in terms of marginalization, sums, and products. This is a really neat result. And despite the complications involved in defining everything, I find this more comprehensible than the other expositions of generalized message-passing algorithms I’ve read. (I’m looking at you, Wainwright and Jordan.) I’d still like to understand this better, though. What exactly is the relationship between the operator $\Delta$ and a sheaf Laplacian? Is there a sheaf-theoretic way to understand the max-product algorithm for finding maximum-likelihood elements rather than marginal distributions? Can we use (possibly conically constrained) cohomology of $\mathcal A^*$ to say anything about the pseudomarginal problem?