close
close

Combining Next Token Prediction and Video Dissemination in Computer Vision and Robotics | MIT News

Combining Next Token Prediction and Video Dissemination in Computer Vision and Robotics | MIT News

In the current AI zeitgeist, sequence models are becoming increasingly popular due to their ability to analyze data and predict what to do next. For example, you’ve probably used next token prediction models like ChatGPT, which anticipate each word (token) in a sequence to form responses to user queries. There are also full-sequence diffusion models like Sora, which transform words into dazzling, realistic images by successively “denoising” an entire video sequence.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have proposed a simple change to the diffusion training scheme that makes this sequence denoising significantly more flexible.

When applied to areas such as computer vision and robotics, the next-token and full-sequence diffusion models have performance tradeoffs. Next token models can spit out sequences of different lengths. However, they generate these generations without being aware of desirable states in the distant future – such as controlling their sequence generation towards a specific target 10 tokens away – and therefore require additional mechanisms for long-term (long-term) planning. Diffusion models can perform such future-contingent sampling, but they lack the ability of next-token models to produce sequences of variable length.

Researchers at CSAIL want to combine the strengths of both models and have developed a sequence model training technique called “diffusion forcing.” The name comes from “Teacher Forcing,” the traditional training scheme that breaks down the full sequence generation into the smaller, simpler steps of the next token generation (similar to how a good teacher simplifies a complex concept).

Diffusion forcing has found similarities between diffusion models and teacher forcing: both use training schemes in which masked (noisy) tokens are predicted from unmasked tokens. In the case of diffusion models, they gradually add noise to the data, which can be viewed as fractional masking. The MIT researchers’ diffusion forcing method trains neural networks to clean a collection of tokens, removing different amounts of noise in each one while simultaneously predicting the next few tokens. The result: a flexible, reliable sequence model that led to higher quality artificial videos and more precise decision-making for robots and AI agents.

By sorting through noisy data and reliably predicting the next steps of a task, diffusion forcing can help a robot ignore visual distractions and perform manipulation tasks. It can also produce stable and consistent video sequences and even guide an AI agent through digital mazes. This method could potentially allow household and factory robots to take on new tasks and improve AI-generated entertainment.

“Sequence models aim to rely on the known past and predict the unknown future, a type of binary masking. However, the masking does not have to be binary,” says Boyuan Chen, lead author, MIT PhD student in Electrical Engineering and Computer Science (EECS) and CSAIL member. “With diffusion forcing, we add different levels of noise to each token, effectively acting as a type of fractional masking. At the time of testing, our system can “unmask” a collection of tokens and propagate a sequence with a lower noise level in the near future. It knows what it can trust in its data to overcome inputs outside the distribution.”

In several experiments, diffusion forcing managed to ignore misleading data to perform tasks while anticipating future actions.

For example, when implemented into a robotic arm, it helped swap two toy fruits on three circular mats, a minimal example of a family of long-term tasks requiring memories. The researchers trained the robot by controlling (or teleoperating) it remotely in virtual reality. The robot is trained to mimic the user’s movements via its camera. Even though it started from random positions and saw distractions like a grocery bag blocking the markers, it placed the objects at their target points.

To create videos, they trained diffusion forcing using “Minecraft” gameplay and colorful digital environments created using Google’s DeepMind Lab Simulator. Using a single movie frame, the method produced more stable, higher resolution videos than comparable baselines such as a Sora-like full-sequence diffusion model and ChatGPT-like next token models. These approaches produced videos that appeared inconsistent, with the latter sometimes failing to generate working videos beyond just 72 frames.

In addition to generating fancy videos, diffusion forcing can also serve as a movement planner targeting desired outcomes or rewards. Thanks to its flexibility, Diffusion Forcing can uniquely create plans with different horizons, perform a tree search, and take into account the intuition that the distant future is more uncertain than the near future. When tasked with solving a 2D maze, Diffusion Forcing outperformed six baselines by generating faster plans leading to the target location, suggesting it could be an effective planner for robots in the future.

In each demo, Diffusion Forcing acted as a full sequence model, a next token prediction model, or both. According to Chen, this versatile approach could potentially serve as a powerful backbone for a “world model,” an AI system that can simulate the dynamics of the world by training on billions of Internet videos. This would allow robots to perform novel tasks by imagining what they need to do based on their environment. For example, if you asked a robot to open a door without being trained, the model could produce a video showing the machine how to do it.

The team is currently trying to extend their method to larger data sets and the latest transformer models to improve performance. They plan to expand their work to develop a ChatGPT-like robot brain that helps robots perform tasks in new environments without human intervention.

“With Diffusion Forcing, we are taking a step toward bringing video generation and robotics closer together,” says lead author Vincent Sitzmann, an assistant professor at MIT and a member of CSAIL, where he leads the Scene Representation group. “Ultimately, we hope that we can use all the knowledge stored in videos on the Internet to enable robots to help in everyday life. Many more exciting research challenges remain, such as how robots can learn to imitate humans by observing them, even when their own bodies are so different from ours!”

Chen and Sitzmann co-authored the paper with MIT’s most recent visiting researcher, Diego Martí Monsó, and CSAIL members: Yilun Du, an EECS graduate student; Max Simchowitz, former postdoc and future assistant professor at Carnegie Mellon University; and Russ Tedrake, Toyota Professor of EECS, Aerospace and Mechanical Engineering at MIT, vice president of robotics research at Toyota Research Institute and CSAIL member. Their work was supported in part by the US National Science Foundation, the Singapore Defense Science and Technology Agency, Intelligence Advanced Research Projects Activity through the US Department of the Interior, and the Amazon Science Hub. They will present their research at NeurIPS in December.

Related Post