wan-2.2

Wan 2.2

Wan2.2 is a state-of-the-art open-source video generative model released by Alibaba's Tongyi Lab (Wan AI Team), designed to democratize high-quality video creation.

Wan 2.2

Major Upgrades

Wan2.2 utilizes a novel MoE architecture that separates the denoising process into specialized stages, significantly enhancing model capacity and performance without increasing computational cost.

Trained on a massive dataset (65.6% more images, 83.2% more videos than Wan2.1) with rich aesthetic labels, delivering superior texture, lighting, and color consistency suitable for cinematic production.

The introduction of Wan2.2-S2V-14B allows for high-fidelity video generation driven directly by audio inputs, achieving state-of-the-art performance in lip-sync and motion synchronization.

Model Details

PublisherWan AI (Alibaba)
Open StatusOpen Source (Apache 2.0)
Model Parameter5B (Hybrid), 14B (MoE)
MultimodalT2V, I2V, S2V, V2V
Including ModelsWan2.2-T2V-14B, Wan2.2-I2V-14B, Wan2.2-S2V-14B, Wan2.2-TI2V-5B
Output Aspect Ratio16:9, 9:16, 1:1, 4:3
Output Resolution480p, 720p, 1080p
Output Duration5s (optimized), up to 10s
Output Frame Rate16fps, 24fps

Summary

Wan2.2 stands out as a powerful open-source contender, leveraging a Mixture-of-Experts architecture to deliver high-fidelity, cinematic video generation. Its unique "Last Frame" control and visual text generation capabilities offer creators unprecedented precision, while its efficient 5B variant democratizes access to high-quality video synthesis. Although its audio-driven metrics (PSNR/SSIM) show room for improvement in specific tasks, its overall visual aesthetic and motion smoothness are top-tier.

Key Features

A unique feature allowing users to specify the final frame of the video, enabling precise control over transitions and ending states.

The first video model capable of generating coherent Chinese and English text within the video content.

Video Showcases

animal
unusual activity

Dogs are the players at The World Series Of Poker and they are drinking big bowls of water very sloppily and splashing water on the cards and on the felt of the poker table, one dog poker player is tilting their head sideways in confusion.

camera motion
human - activity

A low-angle shot of a dancer leaping gracefully into the air, making their movement appear even more dynamic and powerful.

unusual subject
high motion level

A giant humanoid, made of fluffy blue cotton candy, stomping on the ground, and roaring to the sky, clear blue sky behind them.

scene
camera motion

A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast, the view showcases historic and magnificent architectural details and tiered pathways and patios, waves are seen crashing against the rocks below as the view overlooks the horizon of the coastal waters and hilly landscapes of the Amalfi Coast Italy, several distant people are seen walking and enjoying vistas on patios of the dramatic ocean views, the warm glow of the afternoon sun creates a magical and romantic feeling to the scene, the view is stunning captured with beautiful photography.

Performance Metrics

Wan 2.2 Model Capability Assessment (Dec 20, 2025)

Wan 2.2
Radar chart showing model performance metrics20406080100SubjectConsistencyTemporalConsistencyAesthetic &Image QualityDynamics &MotionFidelityVisualQualitySemanticAlignment

Wan 2.2 Metrics Bar Charts by Dimension

Visual Quality

Visual Quality Metrics

PSNR
45.0
SSIM
47.3
LPIPS
36.0
FVD
18.4
Inception Score (IS)
36.0
020406080100

Score (normalized)

Temporal Consistency

Temporal Consistency Metrics

Temporal Warping Error
42.0
Optical Flow Consistency
54.0
Temporal Flicker Score
48.0
Long-term Consistency Tracking
43.2
Motion Smoothness
48.0
020406080100

Score (normalized)

Semantic Alignment

Semantic Alignment Metrics

CLIP Score
42.0
Tag2Text / UMT / GRiT
36.0
Semantic Accuracy
45.0
020406080100

Score (normalized)

Subject Consistency

Subject Consistency Metrics

DINO Feature Similarity
40.5
Object Identity Tracking
42.0
Multiple Object Consistency
49.5
020406080100

Score (normalized)

Aesthetic & Image Quality

Aesthetic & Image Quality Metrics

LAION Aesthetic Predictor
36.0
MUSIQ Score
54.0
Color/Texture Consistency
66.0
Human-Opinion MOS
54.0
020406080100

Score (normalized)

Dynamics & Motion

Dynamics & Motion Metrics

Action Recognition Accuracy
36.0
Dynamics Controllability
48.0
Motion Diversity Score
54.0
Physical Realism Score
32.4
020406080100

Score (normalized)

Service Providers

H

Hugging Face

Hosts the official model weights and inference spaces for Wan2.2, allowing users to try the model directly in the browser.

API Providers

F

Fal.ai

Offers optimized API endpoints for Wan2.2, including the 5B and 14B variants, suitable for enterprise-grade integration.

R

Replicate

Provides scalable API access to Wan2.2 models, including speed-optimized versions for rapid generation.

People Also Ask

Wan 2.2 AI is an open-source, large-scale video generative model that uses a Mixture-of-Experts diffusion architecture to produce high-quality 720p videos from text, images, speech, or combinations of these inputs. It supports text-to-video, image-to-video, text-image-to-video, speech-to-video, and character animation modes, and is designed to run both in research/production backends and on high-end consumer GPUs such as RTX 4090.

The core Wan 2.2 model weights are released under Apache 2.0 and do not include built-in content filters, so technically the model can be prompted to generate NSFW content when run locally or in third-party tools that do not add extra safety layers. However, the official Wan services and most commercial platforms require users to comply with their usage policies and applicable laws, which typically prohibit illegal or harmful content even if the model itself is not hard-censored.

To install Wan 2.2 locally, you generally clone the official GitHub repository, install the Python dependencies, and then download one or more model checkpoints (e.g., T2V-A14B, I2V-A14B, TI2V-5B, S2V-14B, Animate-14B) from Hugging Face or ModelScope. A typical setup involves git clone of the Wan2.2 repo, pip install -r requirements.txt (plus optional extras like speech-to-video requirements), and then using the provided scripts or Diffusers/ComfyUI integrations to load the downloaded weights.

Wan 2.2 can be used through several interfaces: the official wan.video website, native Python/CLI scripts (generate.py for different tasks), and integrations with frameworks like Diffusers and ComfyUI. In practice, you choose a task (such as text-to-video or image-to-video), specify resolution and checkpoints, provide a prompt and optional reference media, and then run generation either locally on your GPU or via supported online interfaces.

Yes, the Wan 2.2 model weights and code are released as open source under the Apache 2.0 license, so you can download, modify, and use them (including commercially) without paying licensing fees to the authors, subject to the license terms. Some hosted services and cloud providers that expose Wan 2.2 (for example, web UIs or GPU rental platforms) may charge for compute, storage, or premium features, even though the underlying model itself is free.

From a compliance perspective, the official project states that users are responsible for ensuring their generated content does not violate laws or cause harm, and the models are distributed under Apache 2.0 with explicit responsibility and usage restrictions in the license and usage policy. From a security and privacy perspective, running Wan 2.2 locally keeps your data on your own hardware, while third-party NSFW or general-purpose Wan 2.2 services typically emphasize private storage of outputs but have varying safety, moderation, and logging practices that you should review individually.

References