Knowledge graph embeddings are probably not that interesting these days. It was never quite clear to me exactly what they were supposed to accomplish (something to do with automated fuzzy reasoning?) but it seems like most of those use cases have been superseded by LLMs. But when I was working knowledge graph embeddings for a minute, this was something that bugged me.

The idea behind knowledge graph embeddings is pretty well encapsulated by the whole word2vec story about analogical reasoning. You know the deal, \(\text{king} - \text{man} + \text{woman} \approx \text{queen}\).1 The idea is that the vector \(\text{woman} - \text{man}\) approximately represents some semantic relationship like “is the female equivalent of.” You can make this idea much more explicit, and learn embeddings that are meant to directly encode some set of known relationships between entities. This means you start with some knowledge graph—a set of entities \(\mathcal{E}\), a set of relations \(\mathcal{R}\), and a set \(\mathcal{T}\) of factual triplets \((x, r, y)\), with \(x, y \in \mathcal{E}\) and \(r \in \mathcal{R}\). If \((x, r, y) \in \mathcal{T}\) we know that the relation \(r\) holds between entities \(x\) and \(y\). An equivalent way to frame this, probably a bit more familiar to mathematicians, is to think of \(\mathcal{R}\) as literally a collection of relations on the set \(\mathcal{E}\).

There are maybe a surprising number of knowledge graphs that you could start from. The Semantic Web was once the future of knowledge. All of the information in the world would be linked together with URIs and RDF triplets! You could have a software agent that would query the World Wide Web for you and find answers to complicated questions!2 Anyway, there’s still a good amount of this kind of information sitting around. Why limit yourself to writing Wikipedia articles when you can convert them into precisely formulated factual statements on Wikidata?

Anyway, the point of view of a knowledge graph as a set of relations on the entity set naturally raises the question of what properties those relations satisfy. Some of these sorts of properties are immediately familiar to anyone who took a sufficiently introductory class on naive set theory:

  • reflexivity
  • (anti)symmetry
  • transitivity
  • (co)injectivity3

A lot of the relations we might want to model in a knowledge graph naturally satisfy some of these properties. This is even potentially useful if we have an incomplete knowledge graph; properties like transitivity can help us infer relations that we don’t have a direct record of. For instance:

  • The relation “is a friend of” is (hopefully?) symmetric
  • The relation “is a subordinate of” is antisymmetric
  • The relation “is a type of” is transitive
  • The relation “is an author of” is not injective (books can have more than one author) or coinjective (authors can write more than one book)
  • The relation “is married to” is (usually) injective and coinjective (i.e. it defines a bijection on its domain)

But it’s actually surprisingly tricky to come up with knowledge graph embedding schemes where the representations are clearly able to satisfy these kinds of properties. Let’s look at TransE, the embedding method that corresponds with the word2vec analogy arithmetic. Entities and relations are all represented as vectors in the same space, and we want the existence of a triple \((x,r,y)\) to imply that \(v_x + v_r \approx v_y\). Let’s make that \(\approx\) an \(=\) for now and think about what properties a relation encoded this way can exhibit.

  • The only reflexive relation represented in this way the identity relation.
  • The only symmetric relation represented in this way is also the identity.
  • Every relation represented in this way is strictly antisymmetric (except the identity).
  • The only transitive relation represented in this way is the identity.
  • Every relation represented in this way is injective and coinjective.

I don’t think most of these constraints are things that can be swept under the rug by allowing the equality to be only approximate. It doesn’t help at all with symmetry or transitivity. You can kind of work around the enforced injectivity injectivity by clustering vectors representing different tail entities \(y'\) around \(v_x + v_r\). This works when you only have one relation to care about. But say we want to encode “\(x\) is an author of \(y\)” for a bunch of books, and also “\(y\) has a topic of \(z\)” for each of these books. Now every book this author writes has to be about the same topic (or cluster of topics)!

TransH is another method that was designed to solve the problem that TransE can only represent bijective relations. The idea is to choose a subspace \(V_r\) for each relation \(r\) and project onto \(V_r\) before checking the approximate equality: \(P_{V_r} (v_x + v_r) \approx P_{V_r} v_y\). This is a lot nicer, and has a natural way of dealing with one-to-many and many-to-many relations. But it’s still limited in weird ways. If you want a reflexive, transitive, or symmetric relation, you need to set \(v_r = 0\), so any relation satisfying one of these properties satisfies all three and is in fact an equivalence relation.

The main reason I’m interested in thinking about knowledge graph representations these days is to better understand how LLMs encode relational knowledge. LLMs know facts like “Michael Jordan plays basketball”4, and can do precisely the sort of compositional reasoning with these facts that knowledge graph embeddings are supposed to enable. They can even do this with facts provided only in context and not in their training data. To do this, they’re working with fairly straightforward operations on vector representations of entities. There’s been a bit of work at trying to find what these representations look like, mostly using approaches that represent relations with individual matrices. The idea for these is that either \(A_r v_x \approx v_y\) or \(\langle A_r v_x, v_y \rangle \gg 0\) for true relations.

The first approach has many of the same issues as TransE, since it’s still trying to represent a possibly one-to-many relation with a function. Matrix multiplication is significantly more expressive than vector addition, though—in particular it’s not always injective, so it can natively represent many-to-one relationships. Transitivity is still tricky, though. If we want \(A_r v_x \approx v_y\) and \(A_r v_y \approx v_z\) to mean \(A_r v_x \approx v_z\) we need \(A_r\) to be (approximately) idempotent, at least on a subspace that contains the embeddings of all relevant entities. But that means that it’s a projection matrix, and so everything except the first entity in a transitive chain of relations needs to have the same embedding. You run into similar issues if you want to represent reflexive relationships. There are more things you can do to increase expressivity—my paper on this took things about as general as they could get while keeping everything linear—but the more expressive it gets, the easier it is to overfit as well.

One lesson to take from this whole thing is that actually, linear representations kinda suck if you want to encode a large universe of concepts simultaneously. And this is why language models don’t operate linearly on static encodings of concepts. Internal token representations are contextual; it’s entirely possible that the existence of a relation between embeddings of two entities is only linearly detectable when the context implies that it may be important to extract that relation. For instance, when a model is prompted with “What is the capital of France?”, the embedding of “France” is different than when it’s prompted with “What is the GDP of France?”, and it may be that the representation of the “capital of” relation only works with the first. This throws a bit of a wrench in the idea that we might be able to easily decode all of the relations an LLM is aware of from its internal activations in a single context. So there’s definitely a lot more to be done to understand an LLM’s knowledge graph (whatever that actually means). I’m just not sure how much studying knowledge graph embeddings will actually help.

  1. There’s some doubt about whether word2vec was actually learning anything nearly this sophisticated. It turns out that \(\text{man}\) and \(\text{woman}\) were already pretty close, as were \(\text{king}\) and \(\text{queen}\), so maybe all this equation was picking up on is that \(\text{king} \approx \text{queen}\). Is this maybe an early interpretability illusion

  2. Man, I miss the Internet of the early 2010s. The promise and vision of the future was so much cooler than any future we were plausibly going to end up with. 

  3. OK, maybe “coinjective” isn’t a term most people learn in an introduction to proofs course. But it’s useful, just meaning that the converse relation is injective. 

  4. This specific example is weirdly common in literature on the topic of what LLMs know and how they know it.