Starting with the inspiring example, what is the output of the following code:
|
|
Answer:
|
|
The example here is pretty advanced, we are indexing with multi-dimensional array (does anyone actually do that?). So let’s recap the basics.
Simple case scenario
In the simplest form, each indexing array is just 1 dimension array.
What is interesting is that each indexing array if is of shape (N)
then the array resulting from the indexing on all dimensions of the original array (like in the example)
is also an array of shape (N)
.
As we have seen in our introductory example, not all dimensions of the original array have to be indexed.
On some dimensions we can choose all the elements - we then put :
instead of an indexing array in this dimensions.
This works quite intuitively. The way to think of this visually is that dimensions we index disappear and are replaced with the dimension of the indexing arrays.
Notice the third example. Here we have:
|
|
Why not (4,5,2,2)
? The advanced indexing behaves differently than slicing in this matter.
We would have (4,5,2,2)
if for each indexing array we would do: pick all possible combinations of [2,3]
on dimension 3 and [2,3]
on dimension 4.
But instead we do: pick 2 elements:
- first element has first value from [2,3] on dimension 3 and first value from [2,3] on dimension 4
- second element has second value from [2,3] on dimension 3 and second value from [2,3] on dimension 4
- third element has third value etc.
So there are only len([2,3]) / len(some_indexing_array)
as much combinations, not
len(indexing_array_1) * len(indexing_array_2) * len(indexing_array_3) * ...
combinatinos.
The way advanced indexing works is visualized on the left. On the right there is a visualization
how taking all possible combinations works.
This is unintuitive because of the slicing indexing, that behaves differently. It takes all the possible combinations:
arr[:, :, 2:4, 2:4].shape=(4,5,2,2)
.
Adding more dimensions
Up to now we were only indexing with the 1D arrays. But it is also possible to index the array’s dimensions with multidimensional arrays. The way it works is exacly the same to the single dimension. The only way the shape affects the output is through the output shape of the resulting array.
If for example we would index an array a.shape=(5,6,7)
with 3 arrays of dimension (2,3)
:
|
|
What happens is that the indices are still matched across the indexing arrays, but the shape of them is intact.
|
|
In general we can say that the equation for indexing is:
|
|
This has the same property we had with 1D indexing - the resulting shape is of the same size as the indexing arrays!
So what happens when we combine the partial indexing with multidimensional arrays?
|
|
So the original dimensions (7,8)
we had were indexed with the arrays of shape (2,2)
and the resulting shape was inserted at the exact place where the (7,8)
had been. The number of dimensions doesn’t matter - we can for example index with 3 arrays of shape (2,2)
and all the 3 dimensions will be replaced with (2,2)
.
This leads to a problem…
What should happend if the indexed dimensions are a not contiguous subsequence of the shape?
Meaning what would happend if we had x[ind1, :, ind2]
?
Here we index the dimensions (5,7) with a 1D array of shape (2). What should be the shape of the result? (it’s not the (2,2), if you don’t see why go back to the starting sections, it’s the way indexing works). We have two options. The left dimensions is replaced with 2 or right one. The authors of this design could invent something like “we should place them on the left-most indexed dimension”. It would be a valid design, but what about the multiidemsional indexing arrays? If we replace two dimensions with 3 dimensions where they should go?
To fix this issue there is an “if” statement in the integer array indexing rules stating:
- if the indexed dimensions are not contiguous then the indexing arrays dimensions are put at the begging
This means if we have a situation like in the begginning:
|
|
Other tricks
There are a few more tricks. If the indexing arrays are of different shape, they are broadcasted togheter. This enables trick like this, to get the 4 corners of the (256,256) image:
|
|
Other tricks:
- Instead of writing a lot of “:” we can write an ellipsis:
a[:, :, :, [5,2]]
is equal to thea[..., [5,2]]
- All arrays are copied when using advanced integer array indexing. This is different than with slicing, where an subarray is an reference to the original one.
- You can write
a[None, :, :]
whereNone
actually means adding empty dimension(1)
- You can index whole array with any iterable of arrays, like `a[[[2,3], [4,5]]], it doesn’t have to be tuple
Actual use case
So now that we know how advanced integer array indexing works in numpy let’s use this knowledge to solve a task that I came across when solving a homework from Natural Language Processing class.
Task description
*Write a benchmark for the Large Language Model. The way LLM’s work is they predict the next token. We can say that they output the distribution across all possible tokens - let’s say there are 50 000 tokens. The benchmark is a set of questions and ABCD answers, for example we input the model whole question-choises sequence:
|
|
We score the model for the whole answer sequence. Let’s say that we want to calculate probability of the asnwer
via monocarbylic acid transporters
. There are a few ways to do that. The simples one is check if the model outputs “D”.
The better method is to check for the probability, that the question-choices sequence that the next token is “D”.
But in our scenario we check for the probablity of the whole answer sentence. We multiply the probabilities for the next token
for the whole sequence. In pseudocode:
|
|
The next token prediction distribution has the shape (50_000)
- we are interested in the probability of
the one token that is in the answer.
The output of the model is of shape (batch_dim, sequence_length, no_tokens)
.
That is, for each batch we input the sequence like via monocarbylic acid transporters
and we get for each token in this sentence we get the predicted distribution across tokens
(in reality the predictions are shifted by one, so like [via][mono][carbylic]
predicts [mono][carbylic][acid]
, but let’s ignore that for now).
Let’s assume we are given the tokenized answer answer_tokenized.shape=(sequence_len)
where answer[i]
is the ID of the token.
|
|
The output from the model is
|
|
For example if ID of the token Hi
is 30, the predictions[0,30] = 20%
means that the probability of the first token is 20%.
We want the results:
|
|
Let’s use advanced indexing!
We want to index the no_tokens
dimension - 2nd dimension in the prediction
.
We would like to index them with answer_tokenized
. But there is a problem.
If we would do:
|
|
We wouldn’t pair the answer_tokenized
sequences with the predictions
sequences!
For every the sequence in the prediction
we would have all the sequences from the answer_tokenized
.
To fix that we can add one artificial indexing array.
|
|
Now every prediction sequence only get’s one token probability, just like we wanted.
To make things more realistic let’s add batch dimension. Now:
|
|
Now we have to match not only place in the sequence but also the index of the batch. The first sample from the batch is the first sequence, second sample is the second sequence etc. We can do:
|
|
For every i in batch_idx
and for every j in sequence_idx
we get answer_probabilities[i,j] = predictions[i, j, answer_tokenized[i,j]]
.
This way we can simulate the “every possible combination” approach in the advanced indexing.
There is a handy shortcut functions for this scenario
This pattern is so common that there exist function that do exacly that.
For example in Pytorch there exists torch.gather
function which according to docs:
torch.gather(input, dim, index) Gathers values along an axis specified by dim.
For a 3-D tensor the output is specified by:
1 2 3
out[i][j][k] = input[index[i][j][k]][j][k] # if dim == 0 out[i][j][k] = input[i][index[i][j][k]][k] # if dim == 1 out[i][j][k] = input[i][j][index[i][j][k]] # if dim == 2
input and index must have the same number of dimensions.
Numpy also has its counterpart numpy.take_along_axis
which is a specialized version ot numpy.take
which
does: A call such as np.take(arr, indices, axis=3) is equivalent to arr[:,:,:,indices,…]..
The take_along_axis:
numpy.take_along_axis(arr, indices, axis) Take values from the input array by matching 1d index and data slices.
This iterates over matching 1d slices oriented along the specified axis in the index and data arrays, and uses the former to look up values in the latter. These slices can be different lengths.
Functions returning an index along an axis, like argsort and argpartition, produce suitable indices for this function.
Meaning that instead of writing artificial batch_idx
and sequence_idx
we can write:
|
|
With answer_tokenized[..., np.newaxis]
we are adding one empty dimension match the number of dimensions
answer_tokenized[..., np.newaxis].shape=(batch_size, sequence_length, 1)
with predictions.shape=(batch_size, sequence_length, no_tokens)
.
In some way it is even easier to understand. If the indices
argument would have shape (batch_size, sequence_length, 3)
for each batch_size
and for each sequence_length we would take 3 tokens. But there would be a problem when taking 2 dimensional indices with this function,
then we can always fall back to np.arange()
indexing arrays with broadcasting.
The conclusion is that although the advanced indexing is pretty complicated and unintuitive at first glance, it allows to perform very wide range of tasks.