We've also just now achieved much better clustering results with a slight change in our data visualization approach:
This approach is designed to maximally ignore spurious variation in the embeddings and focus only the variation which is relevant for the output, yielding crisp, clean, clusters.
Interestingly, layers 29-30 seem to be the most clustered, and layer 32 is not as cleanly separated! We don't quite know the reason for this, but speculate that it may be due to the influence of unrelated continuations (ie. x + y = ____, x + y = ?, etc.)