Beyond Training Context Length
Transformer-based Language models have a context length limitation during training. Let's denote this length as `L`. It is used to sample `L` tokens from the training dataset for each training example. As you can imagine, the larger the `L`, the more compute is needed for training and the slower the training speed. But in inference time, it would be great if the model's context length can be longer than the training context length...
Date: January 31, 2025