Featured image of post NumPy explained - Advanced indexing

NumPy explained - Advanced indexing

Level up in the numpy array data manipulation

Starting with the inspiring example, what is the output of the following code:

1
2
3
4
5
6
7
8
9
x = np.arange(5 * 6 * 7 * 8).reshape(5, 6, 7, 8)
ind1 = np.array([[1,1], [2,2]])
ind2 = np.array([[1,2], [1,2]])

print(x[ind1, ind2, :, :].shape)
print(x[:, ind1, ind2, :].shape)
print(x[:, :, ind1, ind2].shape)
print(x[ind1, :, ind2, :].shape)
print(x[ind1, :, :, ind2].shape)

Answer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> print(x[ind1, ind2, :, :].shape)
(2, 2, 7, 8)
>>> print(x[:, ind1, ind2, :].shape)
(5, 2, 2, 8)
>>> print(x[:, :, ind1, ind2].shape)
(5, 6, 2, 2)
>>> print(x[ind1, :, ind2, :].shape)
(2, 2, 6, 8)
>>> print(x[ind1, :, :, ind2].shape)
(2, 2, 6, 7)

The example here is pretty advanced, we are indexing with multi-dimensional array (does anyone actually do that?). So let’s recap the basics.

Simple case scenario

img

In the simplest form, each indexing array is just 1 dimension array. What is interesting is that each indexing array if is of shape (N) then the array resulting from the indexing on all dimensions of the original array (like in the example) is also an array of shape (N).

As we have seen in our introductory example, not all dimensions of the original array have to be indexed. On some dimensions we can choose all the elements - we then put : instead of an indexing array in this dimensions.

img

This works quite intuitively. The way to think of this visually is that dimensions we index disappear and are replaced with the dimension of the indexing arrays.

Notice the third example. Here we have:

1
arr[:,:,[2,3],[2,3]].shape=(4,5,2)

Why not (4,5,2,2)? The advanced indexing behaves differently than slicing in this matter. We would have (4,5,2,2) if for each indexing array we would do: pick all possible combinations of [2,3] on dimension 3 and [2,3] on dimension 4.

But instead we do: pick 2 elements:

  • first element has first value from [2,3] on dimension 3 and first value from [2,3] on dimension 4
  • second element has second value from [2,3] on dimension 3 and second value from [2,3] on dimension 4
  • third element has third value etc.

So there are only len([2,3]) / len(some_indexing_array) as much combinations, not len(indexing_array_1) * len(indexing_array_2) * len(indexing_array_3) * ... combinatinos. The way advanced indexing works is visualized on the left. On the right there is a visualization how taking all possible combinations works.

img

This is unintuitive because of the slicing indexing, that behaves differently. It takes all the possible combinations: arr[:, :, 2:4, 2:4].shape=(4,5,2,2).

Adding more dimensions

Up to now we were only indexing with the 1D arrays. But it is also possible to index the array’s dimensions with multidimensional arrays. The way it works is exacly the same to the single dimension. The only way the shape affects the output is through the output shape of the resulting array.

If for example we would index an array a.shape=(5,6,7) with 3 arrays of dimension (2,3):

1
2
3
4
5
6
7
x = np.arange(5 * 6 * 7).reshape(5, 6, 7) # shape (5,6,7)
ind1 = np.array([[1,1,1], [2,2,2]]) # shape (2,3)
ind2, ind3 = ind1, ind1

x_indexed = x[ind1, ind2, ind3]
x_indexed.shape
>>> (2,3)

What happens is that the indices are still matched across the indexing arrays, but the shape of them is intact.

1
2
3
4
x_indexed[0, 0] = x [ ind1[0,0], ind2[0,0], ind3[0,0] ]
x_indexed[1, 0] = x [ ind1[1,0], ind2[1,0], ind3[1,0] ]
x_indexed[0, 2] = x [ ind1[0,2], ind2[0,2], ind3[0,2] ]
# and so on...

In general we can say that the equation for indexing is:

1
2
x_result = x[ind1, ind2, ind3, ...]
x_result[i,j,k,...] = x [ind1[i,j,k,...], ind2[i,j,k,...], ind3[i,j,k,...], ...]

This has the same property we had with 1D indexing - the resulting shape is of the same size as the indexing arrays!

So what happens when we combine the partial indexing with multidimensional arrays?

1
2
3
4
5
x = np.arange(5 * 6 * 7 * 8).reshape(5,6,7,8)
ind1 = np.array([[1,1], [2,2]])
ind2 = np.array([[1,2], [1,2]])
x[:, :, ind1, ind2].shape
>>> (5,6,2,2)

So the original dimensions (7,8) we had were indexed with the arrays of shape (2,2) and the resulting shape was inserted at the exact place where the (7,8) had been. The number of dimensions doesn’t matter - we can for example index with 3 arrays of shape (2,2) and all the 3 dimensions will be replaced with (2,2).

This leads to a problem…

What should happend if the indexed dimensions are a not contiguous subsequence of the shape? Meaning what would happend if we had x[ind1, :, ind2]?

img

Here we index the dimensions (5,7) with a 1D array of shape (2). What should be the shape of the result? (it’s not the (2,2), if you don’t see why go back to the starting sections, it’s the way indexing works). We have two options. The left dimensions is replaced with 2 or right one. The authors of this design could invent something like “we should place them on the left-most indexed dimension”. It would be a valid design, but what about the multiidemsional indexing arrays? If we replace two dimensions with 3 dimensions where they should go?

To fix this issue there is an “if” statement in the integer array indexing rules stating:

  • if the indexed dimensions are not contiguous then the indexing arrays dimensions are put at the begging

This means if we have a situation like in the begginning:

1
2
3
4
5
x.shape = (5,6,7,8)
ind1.shape = (2,4)
x[ind1, ind2, :, :] -> (2,4,7,8)
x[:, ind1, ind2, :] -> (5,2,4,8)
x[ind1, :, ind3, :] -> (2,4,6,8)

Other tricks

There are a few more tricks. If the indexing arrays are of different shape, they are broadcasted togheter. This enables trick like this, to get the 4 corners of the (256,256) image:

1
2
3
4
5
6
img = np.random.randn(256,256)
x = np.array([[0,255]]) # shape = (1,2)
y = np.array([[0], [255]]) # shape = (2,1)
# broadcasted(x,y) -> (2,2)
img[x,y].shape
>>> (2,2)

Other tricks:

  • Instead of writing a lot of “:” we can write an ellipsis: a[:, :, :, [5,2]] is equal to the a[..., [5,2]]
  • All arrays are copied when using advanced integer array indexing. This is different than with slicing, where an subarray is an reference to the original one.
  • You can write a[None, :, :] where None actually means adding empty dimension (1)
  • You can index whole array with any iterable of arrays, like `a[[[2,3], [4,5]]], it doesn’t have to be tuple

Actual use case

So now that we know how advanced integer array indexing works in numpy let’s use this knowledge to solve a task that I came across when solving a homework from Natural Language Processing class.

Task description

*Write a benchmark for the Large Language Model. The way LLM’s work is they predict the next token. We can say that they output the distribution across all possible tokens - let’s say there are 50 000 tokens. The benchmark is a set of questions and ABCD answers, for example we input the model whole question-choises sequence:

1
2
3
4
5
6
7
8
9
Question: Glucose is transported into the muscle cell:

Choices:
A. via protein transporters called GLUT4.
B. only in the presence of insulin.
C. via hexokinase.
D. via monocarbylic acid transporters.

Correct answer: A

We score the model for the whole answer sequence. Let’s say that we want to calculate probability of the asnwer via monocarbylic acid transporters. There are a few ways to do that. The simples one is check if the model outputs “D”. The better method is to check for the probability, that the question-choices sequence that the next token is “D”. But in our scenario we check for the probablity of the whole answer sentence. We multiply the probabilities for the next token for the whole sequence. In pseudocode:

1
p(via monocarbylic acid transporters) = p(via) * p(monocarbylic | via) * p(acid | via monocarbylic) * p(transporters | via monocarbylic acid)

The next token prediction distribution has the shape (50_000) - we are interested in the probability of the one token that is in the answer. The output of the model is of shape (batch_dim, sequence_length, no_tokens).

That is, for each batch we input the sequence like via monocarbylic acid transporters and we get for each token in this sentence we get the predicted distribution across tokens (in reality the predictions are shifted by one, so like [via][mono][carbylic] predicts [mono][carbylic][acid], but let’s ignore that for now).

Let’s assume we are given the tokenized answer answer_tokenized.shape=(sequence_len) where answer[i] is the ID of the token.

1
answer_tokenized.shape=(sequence_len)

The output from the model is

1
predictions.shape=(sequence_len,no_tokens)

For example if ID of the token Hi is 30, the predictions[0,30] = 20% means that the probability of the first token is 20%.

We want the results:

1
probabilities.shape=(sequence_len)

Let’s use advanced indexing!

We want to index the no_tokens dimension - 2nd dimension in the prediction. We would like to index them with answer_tokenized. But there is a problem. If we would do:

1
predictions[:, answer_tokenized] #shape would be (sequence_len, sequence_len)

We wouldn’t pair the answer_tokenized sequences with the predictions sequences! For every the sequence in the prediction we would have all the sequences from the answer_tokenized.

To fix that we can add one artificial indexing array.

1
2
sequence_indexing = np.arange(sequence_len) # shape (sequence_len)
answer_probabilities = predictions[sequence_indexing, sequence_len] # shape (sequence_len)

Now every prediction sequence only get’s one token probability, just like we wanted.

To make things more realistic let’s add batch dimension. Now:

1
2
3
4
5
batch_dim = 100
sequence_len = 60
no_tokens = 50_000
answer_tokenized = np.random.randint(0, 50_000, size=(batch_dim, sequence_len))
predictions = np.random.rand(batch_dim, sequence_len, no_tokens)

Now we have to match not only place in the sequence but also the index of the batch. The first sample from the batch is the first sequence, second sample is the second sequence etc. We can do:

1
2
3
batch_idx = np.arange(0, batch_dim).reshape((batch_dim, 1))
sequence_idx = np.arange(0, sequence_len).reshape((1, sequence_len))
answer_probabilities = predictions[batch_idx, sequence_idx, answer_tokenized] # shape is (batch_dim, sequence_len)

For every i in batch_idx and for every j in sequence_idx we get answer_probabilities[i,j] = predictions[i, j, answer_tokenized[i,j]]. This way we can simulate the “every possible combination” approach in the advanced indexing.

There is a handy shortcut functions for this scenario

This pattern is so common that there exist function that do exacly that.

For example in Pytorch there exists torch.gather function which according to docs:

torch.gather(input, dim, index) Gathers values along an axis specified by dim.

For a 3-D tensor the output is specified by:

1
2
3
out[i][j][k] = input[index[i][j][k]][j][k]  # if dim == 0
out[i][j][k] = input[i][index[i][j][k]][k]  # if dim == 1
out[i][j][k] = input[i][j][index[i][j][k]]  # if dim == 2

input and index must have the same number of dimensions.

Numpy also has its counterpart numpy.take_along_axis which is a specialized version ot numpy.take which does: A call such as np.take(arr, indices, axis=3) is equivalent to arr[:,:,:,indices,…]..

The take_along_axis:

numpy.take_along_axis(arr, indices, axis) Take values from the input array by matching 1d index and data slices.

This iterates over matching 1d slices oriented along the specified axis in the index and data arrays, and uses the former to look up values in the latter. These slices can be different lengths.

Functions returning an index along an axis, like argsort and argpartition, produce suitable indices for this function.

Meaning that instead of writing artificial batch_idx and sequence_idx we can write:

1
answer_probabilities = np.take_along_axis(predictions, answer_tokenized[..., np.newaxis], 2).squeeze(-1)

With answer_tokenized[..., np.newaxis] we are adding one empty dimension match the number of dimensions answer_tokenized[..., np.newaxis].shape=(batch_size, sequence_length, 1) with predictions.shape=(batch_size, sequence_length, no_tokens). In some way it is even easier to understand. If the indices argument would have shape (batch_size, sequence_length, 3) for each batch_size and for each sequence_length we would take 3 tokens. But there would be a problem when taking 2 dimensional indices with this function, then we can always fall back to np.arange() indexing arrays with broadcasting.

The conclusion is that although the advanced indexing is pretty complicated and unintuitive at first glance, it allows to perform very wide range of tasks.

Built with Hugo
Theme Stack designed by Jimmy