Self Attention

Explained with Q, K, V notions

Hussain Safwan
6 min readJul 19, 2021

Let's kick off with a little scenario. The earth is like an orange, Orange is a bright color. Now, put a little focus on the words orange in either sentence. It should be pretty obvious that the very same word is conveying meanings that are wildly different from each other. In the first sentence, it denotes orange, the fruit while the second one is telling about orange, the color. But how are we able to tell these apart? After all, they are essentially the same words. That’s where context comes into play.

What is attention?

Context plays a major role in our natural conversations. Same words or phrases may assume multiple meanings depending on the context they’re being used. Context is the set of other words around a word that provide the circumstances for the event described to occur. To grasp the complete and intended meaning of a word, or better a token as referred to in NLP, we need to look at the other words that surround it.

Attention is the cognitive mechanism focusing on one or a bunch of tokens that are believed to be of more significance to assist express the intended meaning for the token in question.

Like for the example cited above, The earth is like an orange — to determine whether orange here refers to the color or the fruit, we speculate the word earth and would very much want our algorithm to do so.

In the context of NLP, attention is the process of attaching a bit of context to each of the tokens under procession. Though used in several forms time series, the use of attention mechanism has been greatly popularized by the 2016 paper on the Transformer model, by Google, entitled Attention is all you need. The transformer model uses the mechanism on a couple of varied settiings, but what we’de be looking at is Self attention.

Self Attention

Technically speaking, self attention is the relative degree of attendance each token should ensure to the fellow tokens of the sentence.

It can be thought of as a table that enlists each token both on row and column and (i, j) th cell accounts for the relative degree of attendance ith row should ensure to the jth column. Now the question is how?

Let’s take the previous, Earth is like an orange, sentence as a working example.

  1. We tokenize the sequence and translate each token into a word embedding. Let sequence now be denoted as,

V = [ V1, V2, … Vn]; Vi denoting word vector for ith word

2. Okay, all set to make the first operation. We take each word embedding and assign others with a relative priority with respect to how similar they are. Now how exactly do we measure the similrity between words. Thats why we need the embeddings. This idea of measuring similarity very much deserves it’s own article, but anyway it’d put it in as little words yet as elaborate as possible.

You see, word embeddings are vectors usually of elements in the rage -1 to 1 (negetive meaning “is not”) and of a fixed length depending on which implementation you’re up against. The rows denote a feature. For example, the embedding for the word earth may have a large value for the slot of round (like 0.95 or something, since it’s almost a sphere) and large negetive value for features like rectangle or toxic environment(since earth is neither rectangular, nor toxic to life). On the otherhand orange might also have a large positive number for the feature of round, since it’s roundish too. Now see what happens when we dot-product the embeddings. Since they have large values on similar slots and possibly zero or negligible (either positive or negetive) numbes on others, the dot product generates a rather large result. On the contrary, had they been not similar, they would have had larger values on dissimilar slots that would minimize each other on the event of a dot product and produce a tiny value. So that’s pretty how dot products are used to find similairties between word embeddings.

Say wo begin with the token earth. We dot-product the vector for earth with each of the vector in the sequence, yes with itself too, and recieve a pure number, a score denoting how similar/related they are.

4. Once the similarity scores are retrieved they’re normalized to push then within the range. Now they can formally be termed weights.

5. In this step the weights scalar multiplied with each of the vector embeddings within the sequence, this produces a vector, specific to the word we began with, in this case the word earth. Each entry in the resultant vector tells us how related the word earth is to that particular word. We repeat the same procedure for every token in the sequence and essentially get a matrix (like the table mentioned above).

Here’s a pictorial view of the schematic,

This depicts the iteration for the word orange.

All good so far. But there’s a catch in all of these. We’re passing on a word and finding the weighted matrix for relative attendance, that’s fine. But where’s the learning in all these? Being an ML model, it should strive to learn something every passing iteration, right? Let’s deal with that with a database analogy. We input the algorithm with a word and ask it for the relative attention and the algorithm returns that. It’s like the algorithm acts as a database that when queried with a word (orange in the diagram) and provided with a set of keys (original sequence in the diagram) maps it against a value (also the original sequence, but one shown above) and returns the results. That’s where the notion of Query, Keys and Values drop in.

Again we’ve to perform the same operation for each word, so why not put up a matrix representation of the Queries, Keys and Values. Hence we multiply each of the component with appropriate Q, K and V matrices. Note that this also solves the problem of learning.

How? Well during the backpropagation the back flowing signals update the matrix entries in a way that minimizes the overall errors. The new schematic is shown below,

X, like in X K denotes being dot product’ed with matrix K

Now lets put the entire iterative flow within a package.

The X vector on the left denotes the input word embeddings. Note that X1, X2 are vectors. This matrix is passed to Value, Query and Key layers via matrix multiplication. The outputs of Query and Key, generated via multiplication, is passed through the normalization layer. The normalized outputs, termed as weights, are then mapped to the value parameters to generate the final attention scores. One more thing to note here is the color convention used. The arrows in green denote a forward pass, red denotes backward pass — where the parameters are update that is learning occurs and those that do not identify with either are in blue.

--

--