Skip to content
Back to Projects
Research 2026-02 Flagship Research

Solaris — Multiplayer Video World Model in Minecraft

First multiplayer video world model generating consistent first-person observations for two players simultaneously. NeurIPS 2026 submission. SolarisEngine data collection system captures 12.64M synchronized multiplayer frames. Checkpointed Self Forcing enables long-horizon training at reduced memory cost with 36% FID improvement. Multi-agent evaluation framework using VLM-as-Judge methodology. arXiv:2602.22208, NYU.

Synchronized Frames 12.64M
JAXDiffusion TransformerMultiplayerMinecraftComputer VisionWorld ModelsNeurIPS

What this project proves

Research engineering on data and evaluation systems

Solaris is a multiplayer video world model with 12.64M synchronized frames, multi-agent evaluation, and an arXiv paper listing Dhairya as an author.

Core challenge

Model consistent multi-agent first-person video over long horizons in a dynamic environment.

Evaluation lens

Data collection, world-model training setup, and evaluation-system design.

Publication-grade system with open artifacts, evaluation methodology, and a project website linked from arXiv.

Solaris generating synchronized first-person observations for two players simultaneously in Minecraft.

Overview

Solaris is the first multiplayer video world model, generating consistent first-person observations for two players simultaneously in Minecraft. A NeurIPS 2026 submission, it was trained on 12.64M synchronized frames of coordinated gameplay collected through SolarisEngine — a purpose-built multiplayer data collection system. I authored the episode creation logic and action-based gameplay routines, carefully curated to cover the full distribution of multiplayer interactions the model needed to learn, from cooperative building to adversarial combat, movement, and mining. The system uses a staged training pipeline combining bidirectional, causal, and Self Forcing objectives, and outperforms existing baselines across all evaluation categories. Everything is open-sourced: engine, models, datasets, and evaluation benchmarks. Paper: arXiv:2602.22208.

Key Contributions

Dataset breakdown — 9,240 episodes across 30 episode types, 92% QC pass rate
  • SolarisEngine: A scalable multiplayer data collection system built on Mineflayer with Docker-based parallel orchestration, producing 9,240 episodes and 12.64M action-annotated frames at 20 FPS — establishing the world’s first open-source multiplayer world-model dataset. I designed and implemented the episode creation logic — highly curated gameplay routines purpose-built to satisfy each training requirement of the model, covering cooperative building, adversarial PvP/PvE combat, coordinated movement, and mining with varied terrain, weather, and time-of-day conditions to maximize visual diversity. Authored 30 episode types with deterministic YAML specs and automated QC (visibility %, distance, FPS) achieving a 92% QC pass rate.
  • Multiplayer Architecture: Modified Diffusion Transformer blocks with interleaved multiplayer self-attention, per-player 3D rotary position embeddings, and zero-init learned player ID gating. Built on MatrixGame 2.0. Reduced cross-view mismatch by 22% over all baselines.
  • Checkpointed Self Forcing: A memory-efficient Self Forcing variant for long-horizon training at reduced memory cost — decouples autoregressive rollout from backpropagation via cached frame recomputation, cutting training memory from O(Lt·Ls) to O(Lt) and enabling long-horizon video generation at scale, achieving a 36% FID improvement over baselines. I contributed to the training pipeline supporting this stage by curating targeted evaluation episodes used to diagnose and eliminate visual artifacts during Self Forcing training, helping dial in stable long-horizon generation.
  • Solaris Eval: A multi-agent evaluation framework using VLM-as-Judge methodology, testing 5 capabilities — movement consistency, spatial memory grounding, building, and cross-view coherence — scored by a VLM across 7 held-out ground-truth episodes, achieving ≥92% ground-truth accuracy. Proposes new metrics beyond PSNR/LPIPS/FID: temporal-LPIPS, Identity Consistency, Cross-View Consistency, Semantic Causality, and Reappearance Latency.

Architecture

The system uses a Diffusion Transformer (DiT) with flow matching, initialized from MatrixGame 2.0 and extended for multiplayer:

  1. Interleaved multiplayer self-attention with tokens from all players concatenated in shared attention blocks
  2. 3D rotary position embeddings applied independently per player
  3. Learned player ID embeddings for distinguishing between agents
  4. Staged training: single-player pretraining, then multiplayer fine-tuning with Diffusion Forcing, then Self Forcing for long-horizon stability
SolarisEngine — end-to-end multiplayer data, training, and evaluation framework
Left: DiT block modified for multiplayer visual interleaving. Right: peak HBM memory for naive Self Forcing vs the Checkpointed variant.

Tech Stack

  • Model: Diffusion Transformer (DiT) with flow matching, built on MatrixGame 2.0
  • Framework: JAX
  • Data Engine: SolarisEngine — Mineflayer + Minecraft Java client, Docker Compose orchestration
  • Training: Google TPU (TRC program), Checkpointed Self Forcing, gradient checkpointing
  • Evaluation: VLM-based scoring + FID metrics across 5 multiplayer capability benchmarks