Research 2026-02 Featured

Solaris — Multiplayer Video World Model in Minecraft

First multiplayer video world model generating consistent first-person observations for two players simultaneously, trained on 12.6M frames of coordinated Minecraft gameplay. Published on arXiv, NYU.

12.6M Multiplayer Frames

Live Demo Paper

JAXDiffusion TransformerMultiplayerMinecraftComputer VisionWorld Models

Overview

Solaris is the first multiplayer video world model, generating consistent first-person observations for two players simultaneously in Minecraft. We trained it on 12.6M frames of coordinated gameplay collected through SolarisEngine, a scalable data collection framework we built from scratch. I authored the episode creation logic and action-based gameplay routines — carefully curated to cover the full distribution of multiplayer interactions the model needed to learn, from cooperative building to adversarial combat, movement, and mining. The system uses a staged training pipeline combining bidirectional, causal, and Self Forcing objectives, and outperforms existing baselines across all evaluation categories. Everything is open-sourced: engine, models, datasets, and evaluation benchmarks.

Key Contributions

Dataset statistics across 12.6M frames broken down by task category, episode type, and length — Dataset breakdown — 9,240 episodes across 30 episode types, 92% QC pass rate

SolarisEngine: A scalable multiplayer data collection system built on Mineflayer with Docker-based parallel orchestration, producing 9,240 episodes and 12.64M action-annotated frames at 20 FPS — establishing the world’s first open-source multiplayer world-model dataset. I designed and implemented the episode creation logic — highly curated gameplay routines purpose-built to satisfy each training requirement of the model, covering cooperative building, adversarial PvP/PvE combat, coordinated movement, and mining with varied terrain, weather, and time-of-day conditions to maximize visual diversity. Authored 30 episode types with deterministic YAML specs and automated QC (visibility %, distance, FPS) achieving a 92% QC pass rate.
Multiplayer Architecture: Modified Diffusion Transformer blocks with interleaved multiplayer self-attention, per-player 3D rotary position embeddings, and zero-init learned player ID gating. Built on MatrixGame 2.0. Reduced cross-view mismatch by 22% over all baselines.
Checkpointed Self Forcing: A memory-efficient Self Forcing variant that decouples autoregressive rollout from backpropagation via cached frame recomputation, cutting training memory from O(Lt·Ls) to O(Lt) and enabling long-horizon video generation at scale. I contributed to the training pipeline supporting this stage by curating targeted evaluation episodes used to diagnose and eliminate visual artifacts during Self Forcing training, helping dial in stable long-horizon generation.
Solaris Eval: A multiplayer evaluation framework testing 5 capabilities — movement, grounding, memory, building, and consistency — using VLM-as-judge scoring across 7 held-out ground-truth episodes, achieving ≥92% ground-truth accuracy. Proposes new metrics beyond PSNR/LPIPS/FID: temporal-LPIPS, Identity Consistency, Cross-View Consistency, Semantic Causality, and Reappearance Latency.

Architecture

The system uses a Diffusion Transformer (DiT) with flow matching, initialized from MatrixGame 2.0 and extended for multiplayer:

Interleaved multiplayer self-attention with tokens from all players concatenated in shared attention blocks
3D rotary position embeddings applied independently per player
Learned player ID embeddings for distinguishing between agents
Staged training: single-player pretraining, then multiplayer fine-tuning with Diffusion Forcing, then Self Forcing for long-horizon stability

Solaris model architecture diagram showing modified DiT block with multiplayer interleaving — Left: DiT block modified for multiplayer visual interleaving. Right: peak HBM memory for naive Self Forcing vs the Checkpointed variant.

Peak HBM memory usage comparing naive Self Forcing to Checkpointed Self Forcing — Left: DiT block modified for multiplayer visual interleaving. Right: peak HBM memory for naive Self Forcing vs the Checkpointed variant.

Tech Stack

Model: Diffusion Transformer (DiT) with flow matching, built on MatrixGame 2.0
Framework: JAX
Data Engine: SolarisEngine — Mineflayer + Minecraft Java client, Docker Compose orchestration
Training: Google TPU (TRC program), Checkpointed Self Forcing, gradient checkpointing
Evaluation: VLM-based scoring + FID metrics across 5 multiplayer capability benchmarks

Related Projects

Research

PICO-LLM Research Pipeline

Modular LLM research pipeline (NYU CSCI-GA 2565) for training and evaluating K-Gram MLP, LSTM, and KV-cache Transformer architectures with 22+ experiment configs and rigorous cross-run analysis.