Transformers Viewed as Generalizations of GCNs

Published in

The DL

7 min readDec 13, 2020

I show that Graph Convolutional Networks are (almost) a special case of Transformers, and discuss how that affects our interpretation of each.

Introduction

Transformers and Graph Convolutional Networks (GCNs) have grown tremendously in popularity over the past few years. At first glance, these two model types appear very different. They were designed to solve two very different problems:

Transformers improved on attention-based models in NLP.
GCNs operate on unstructured data types, such as 3D point clouds.

GCNs have received a small amount attention in NLP, but otherwise, the two models are used for different applications. In this article, I will argue that they are actually closely related. In fact, GCNs can be viewed as a variant of Transformer Encoders with some small modifications. I’ll discuss how this affects our interpretation of each, and propose new applications for Transformers as a result.

Note: This will be a high-level discussion without any code samples. You don’t need to know how to code Transformers or GCNs to understand this article. But if you’re interested, see my previous articles on Transformers from Scratch in PyTorch and Graph Convolutional Networks in 2 Minutes.

Transformer Encoder

Let’s start by examining the Transformer Encoder architecture. It consists of N layers, each of which contains Multi-Head Attention and Feed Forward layers:

But instead of Multi-Head Attention, I’ll use a simpler Single-Head Attention layer. This is equivalent to setting num_heads to one, which is strictly a subset of all self-attention variants. The attention module can now be represented as

where Q, K, and V all have shape (batch_size, seq_length, num_features).

Graph Convolutional Network

Many variations of GCN exist, but for this discussion, I’ll use the “Vanilla GCN" architecture as first described in Spectral Networks and Deep Locally Connected Networks on Graphs. Similar to the Transformer above, it consists of N identical layers:

Spectral Convolution is the defining feature for GCN models. This layer multiples the input vectors by a predefined adjacency matrix. In the context of graph data, the adjacency matrix specifies which nodes in the graph are connected by an edge. Multiplication by the adjacency matrix effectively propagates information between neighboring nodes in the graph:

Simplifying the Transformer

We hope to show that GCNs are a special case of Transformers. In order to do that, I will incrementally simplify components of the Transformer above. The behavior of a simplification can be exactly reproduced by the original component through fixing its weights/biases.

Simplification #1
The identity transformation can be exactly reproduced with a linear layer. We just set the weights to an identity matrix, and the biases to zero. Therefore, the identity transformation is an allowed simplification for linear layers.

Using this identity, we can dramatically simplify the Single-Head Attention module, so that it is equivalent to Scaled Dot-Product Attention.

Simplification #2
When the magnitude of Layer(x) is very large, we can approximate (x + Layer(x)) ≈ Layer(x). Normalization just rescales the values after computing the residual. So, the behavior of non-residual layers can be exactly reproduced by residual layers when the outputs are very large. Therefore, non-residual layers are an allowed simplification for residual layers.

With this, we can simplify the Transformer architecture again:

Simplification #3
Feed Forward consists of multiple Linear layers and nonlinear activations applied in series. Clearly, one Linear layer and activation is a special case of this. Therefore, (Linear + Activation) is an allowed simplification of Feed Forward.

After applying this simplification, our Transformer and GCN diagrams look extremely similar:

Attention vs Spectral Convolution

If we could show Spectral Convolution to be a simplification of Dot-Product Attention, then GCNs would constitute a strict subset of all Transformers models! Unfortunately, it’s not possible to do that, because Spectral Convolution uses a predetermined adjacency matrix. Dot-Product Attention directly computes everything from the input array, which contains real-valued elements. In general, we can’t recreate an arbitrary, predetermined adjacency matrix out of non-determinate, real-valued inputs.

Source: https://gph.is/XI8ugn

But there are still similarities between the two, and we can gain additional insight by analyzing them in closer detail. Let’s start with Spectral Convolution, which is illustrated again in the figure below. scree

Each element in the adjacency matrix (A[:, i, j]) is nonzero if nodes i, j are connected by an edge in the graph. This gives it the following properties:

Static — It is predetermined for each set of inputs.
Square — It has shape (batch_size, num_nodes, num_nodes), since it multiplies the input vector with shape (batch_size, num_nodes, num_features).
Sparse — Only some of its elements are nonzero.
Symmetric — Edges are bidirectional, so A[:, i, j] == A[:, j, i].

Dot-Product Attention doesn’t use an adjacency matrix, so how can we compare it with Spectral Convolutions? Notice that one of the intermediate arrays (circled in red below) is similar to the adjacency matrix:

I’ll refer to this as the attention matrix. It roughly describes which elements we should “pay attention" to in the input sequence. It has the following properties, in comparison with the adjacency matrix:

Dynamic — Nothing is predetermined, because it is computed directly from the input vector.
Square — Matrix multiplying the two input arrays of shape (batch_size, seq_length, num_features) results in a square attention matrix with shape batch_size, seq_length, seq_length).
Approximately Sparse — The softmax activation forces many elements in each row to be close to zero. But they’re not exactly zero, since they are all floating point values.
Non-Symmetric — Softmax operates along each row separately, and the result is not symmetric.

We’ve approximately reproduced 2 out of 4 properties of the adjacency matrix! With a small change, we can actually make it symmetric as well — just swap the softmax activation for any point-wise activation function. Depending on our choice of activation, we can also preserve approximate sparsity of the attention matrix! (For example: relu and its variants, and arguably sigmoid.) In that case, the only remaining difference is the dynamic calculation of attention, whereas adjacency is statically provided.

Source: http://gph.is/2eYVMbg

Interpretations

So GCN isn’t quite a special case of Transformers, but there are some very strong similarities between them. What can this teach us about each of the respective model types? My takeaways are listed below, but I expect there are additional insights that I’m overlooking. If you have thoughts to add, please drop a comment at the end of this article! I’ll gradually add those into the article over time.

Does this change our understanding of Transformers or GCNs, respectively?
Probably not in any significant way. It may prove helpful as an alternate interpretation of attention heads in Transformers. But it doesn’t change anything about the way these models are derived, implemented, or even trained. This is mostly just an interesting anecdote, and (hopefully) a fun exercise in understanding the two types of models.
Should we be surprised that GCNs have shown potential on certain NLP tasks?
Absolutely not. Transformers represent that state-of-the-art for essentially every NLP task, and we’ve just shown that GCNs are closely related to them.
Should we be surprised that Transformers are taking over the computer vision space?
I don’t think so. Spectral Convolution is commonly presented as a generalization of image convolution, which allows it to process unstructured graph data. (It’s not quite a true generalization, much like Transformers are not a true generalization of GCNs. But they are closely related conceptually.) So if Transformers are related to GCNs, and GCNs are related to image convolutions, then there must be a connection between Transformers and Convolutional Neural Networks. That may be a distant connection, but it’s a connection nonetheless.
Are there any other tasks that Transformers should be applied to, given their relationship to GCNs?
Definitely. I’m interested to see more Transformers applied to graph-structured data. Transformers will likely require more training data, because they have more trainable parameters and miss out on some pre-existing knowledge from the adjacency matrix. We’ve seen the same effect recently in computer vision — Transformers require more training than CNNs, but produce state-of-the-art performance in the end.

Conclusion

I hope this was an interesting high-level exploration of Transformers, GCNs, and the similarities between them. If you’d like to learn more about how each of them is implemented, see my previous articles:

Transformers from Scratch in PyTorch
Graph Convolutions Networks in 2 Minutes

Questions, comments, and constructive criticism are greatly appreciated — just leave a me a comment on this article. If you enjoyed the article, follow me and The DL to get future articles and weekly updates on the latest in deep learning news.