Gemma 4 Introduces Multi-Token Prediction Drafters for Faster Inference

May 5, 2026

AI Summary

Gemma 4 has launched Multi-Token Prediction (MTP) drafters, enhancing inference speed by up to three times without compromising output quality. This advancement addresses latency issues in large language models, making them more efficient for developers across various applications.

Gemma 4 models have surpassed 60 million downloads shortly after their release.
The introduction of MTP drafters utilizes a speculative decoding architecture to improve inference speed significantly.
MTP drafters allow for the prediction of multiple tokens simultaneously, reducing the time needed for processing.
Standard large language models typically generate text one token at a time, which can lead to inefficiencies.
The new architecture enables the drafter to share activations and KV cache with the target model, enhancing speed and accuracy.
Developers can achieve faster inference for applications like coding assistants and mobile apps by pairing Gemma 4 with MTP drafters.
The MTP drafters are available under an open-source license, with resources for implementation provided online.

gemma 4multi-token predictioninferenceai toolsdevelopers

Gemma 4 Introduces Multi-Token Prediction Drafters for Faster Inference

Related Stories

Thinking Machines Lab develops AI model for simultaneous conversation

ChatGPT Sees Increased Adoption Among Older Users in Early 2026

Optimizing Matrix Multiplication for Swift in LLM Training

arXivLabs Encourages Collaboration on New Features with a Focus on Privacy