Sheaves and Probability, Part 3
The distributed algorithm for inference I described last time is kind of awkward, and not very well linked with the rest of the field. Olivier Peltre has a better one. It’s connected with the well-known message-passing algorithm for this sort of calculation. Unfortunately, while I think I have a decent idea of the general idea of what’s going on, I don’t have a very detailed understanding of all the moving pieces, so some of this exposition might be just a little bit wrong.
We will need to give up a bit of our nice simplicial structure to make everything work properly. We will instead work directly with a cover of our set of random variables by a collection of sets , where we require that be closed under intersection. This requirement makes a partially ordered set under the relation if . (The change in direction is to ensure that our sheaves stay sheaves and our cosheaves stay cosheaves.) The state spaces for each still form a sheaf (i.e., a covariant functor) over ; and so we also get the cosheaf of observables and the sheaf of functionals. (This is probably terrible terminology for anyone not comfortable with cellular sheaves, since typically we want to think of a presheaf as a contravariant functor. Sorry.)
We can move back to the simplicial world by taking the nerve of : this is the simplicial complex whose -simplices are nondegenerate chains of length in . In particular, vertices of correspond to elements of , while edges correspond to pairs . A sheaf or cosheaf on extends naturally to . The stalk over a simplex is , and the restriction maps between 0- and 1-simplices are , , and , .
Extending a sheaf or cosheaf to preserves homology and cohomology (defined in terms of derived functors and all that), so this is not an unreasonable thing to do. The interpretations of and are also the same. Homology classes of correspond to Hamiltonians defined as a sum of local terms, and cohomology classes of (when normalized) correspond to consistent local marginals (or pseudomarginals).
We now come to the part of the framework I have always found most confusing, because it’s seldom very well explained. Peltre (and others, probably) makes a distinction between the interaction potentials which are summed to get a Hamiltonian on , and the collection of local Hamiltonians , obtained by treating each subset as if it were an isolated system and summing the potentials for . Both of these can be seen as 0-cochains of , and they are not really homologically related. There is an invertible linear map which sends a collection of interaction potentials to its family of local Hamiltonians. The formula is straightforward: . The inverse of is the Möbius transform given by . The coefficients are the Möbius numbers of the poset . Notice that these formulas refer to the poset relations, not to the incidence of simplices in , which is a bit frustrating—the simplicial structure doesn’t seem to really be doing much here. However, and can be extended to chain maps, and hence descend to homology. There are analogous operations on cochains of , adjoint to the operations on the chains of .
The action of on calculates the expected value of the Hamiltonian given by a class of interaction potentials with respect to a consistent set of marginals. If a chain represents a collection of local Hamiltonians, then this pairing doesn’t compute an expectation. Instead, if and , we have to take to get the expectation. By adjointness, this is .
For reasons I don’t understand very deeply, these operations play a key role in formulating message passing dynamics. My current understanding is that it has to do with the gradients of the Bethe entropy. The Bethe entropy is defined on a section of by the same inclusion-exclusion procedure as we did before, but using the Möbius coefficients to perform inclusion-exclusion: . Think of this as computing local (i.e. marginalized) entropy and then applying the Möbius transform.
The Bethe free energy is the functional we minimize to find the approximate pseudomarginals: . Adding the constraint that be consistent () and normalized (), we get the Lagrangian condition
which we can then convert to
The Lagrange multiplier is an arbitrary 0-chain which is constant on each vertex, so its zeta transform also satisfies the same condition, and we can roll the into it. is an arbitrary 1-chain. Ultimately, we have the requirement that
This holds on the level of cochains, and of course the cochain must also be in . So in order to be a critical point for the Bethe free energy functional, must be obtained as the zeta transform of something homologous to the given local potential , up to some normalizing additive constant. Solving the marginalization problem amounts to a search for a local potential homologous to the given that makes a section of . Note that every section of is normalized to some constant sum, so the normalization constraint isn’t all that interesting. Further, if , it is automatically nonnegative.
One of the key insights now is that if we define a differential equation on where the derivative of is always in the image of , the evolution of the state will be restricted to the homology class of the initial condition. If we define an operator on which vanishes when is a section of , we’ll be able to use it to implement just such a differential equation.
Let , where is defined by . This operator clearly vanishes when is a section of , since does. In fact, vanishes at precisely these points. This operator is somewhat mysterious, but if you squint hard enough it looks kind of like a Laplacian. In fact, its linearization around a given 0-chain appears to be a sheaf Laplacian.
Thus the system of differential equations has stationary points corresponding precisely to the critical points of the Bethe entropy. If you discretize this ODE with a naive Euler scheme with step size 1, you get a discrete-time evolution equation, and this implies dynamics on local marginal estimates . It turns out that these dynamics on are precisely the standard message-passing dynamics, typically described in terms of marginalization, sums, and products. This is a really neat result. And despite the complications involved in defining everything, I find this more comprehensible than the other expositions of generalized message-passing algorithms I’ve read. (I’m looking at you, Wainwright and Jordan.) I’d still like to understand this better, though. What exactly is the relationship between the operator and a sheaf Laplacian? Is there a sheaf-theoretic way to understand the max-product algorithm for finding maximum-likelihood elements rather than marginal distributions? Can we use (possibly conically constrained) cohomology of to say anything about the pseudomarginal problem?