Wan 2.1 & WanX 2.1 & Alibaba Wan AI

What is Wan 2.1 by Alibaba Wan AI?

Wan AI is an advanced and powerful visual generation model developed by Tongyi Lab of Alibaba Group. It can generate videos based on text, images and other control signals. The Wan 2.1 series models are now fully open-source.Explore examples

Overview of Wan AI

๐Ÿ‘

SOTA Performance

Wan 2.1 consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks.

๐Ÿš€

Supports Consumer-grade GPUs

The T2V-1.3B model requires only 8.19 GB VRAM, making it compatible with almost all consumer-grade GPUs. It can generate a 5-second 480P video on an RTX 4090 in about 4 minutes (without optimization techniques like quantization). Its performance is even comparable to some closed-source models.

๐ŸŽ‰

Multiple tasks

Wan 2.1 excels in Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio, advancing the field of video generation.

๐Ÿ”ฎ

Visual Text Generation

Wan 2.1 is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.

๐Ÿ’ช

Powerful Video VAE of Wan AI

Wan-VAE delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.

Features of Wan AI

Complex Motions by Wan AI

Excels at generating realistic videos featuring extensive body movements, complex rotations, dynamic scene transitions, and fluid camera motions.

Physical Simulation by Wan AI

Generates videos that accurately simulate real-world physics and realistic object interactions.

Cinematic Quality by Wan AI

Offers movie-like visuals with rich textures and a variety of stylized effects.

Controllable Editing by Wan AI

Features a universal editing model for precise edits using image or video references.

Visual Text Generation by Wan AI

Creates text and dynamic text effects in videos directly from text prompts.

โธ

8-Bit Racing

Prompt: A retro 8-bit style animation of a car race intro. Pixelated muscle cars, each with distinct colors and designs, line up at a starting line in a vast, pixelated desert landscape. Large, pixelated text "WANX RACING" flashes above the cars in vibrant neon colors, reminiscent of classic arcade game titles. The camera pans across the scene, highlighting the retro aesthetic and text. The background features a simple, pixelated desert landscape with a blocky sunset casting warm, golden hues over the scene. The entire environment is bathed in vibrant, pixelated neon colors, enhancing the nostalgic feel.

โธ

Merry Christmas

Prompt: Realist, beautifully decorated Christmas party scene, Christmas trees adorne d with colorful lights and gifts, flames dancing in the fireplace, gingerbread people wearing Christmas hats dancing around the tree, and tables filled with grilled turkey and other delicacies. Exquisite text effects pop up on the screen: "Merry Christmas!" The screen is exquisite, sophisticated, and concise.

โธ

Mad Ricing

Prompt: A retro 70s-style title sequence for a fictional action movie. Hand-drawn, stylized text "WANX" appears dynamically on screen, overlaid on fast-paced clips of car chases, explosions, and daring stunts. The text is bold, gritty, and slightly distorted, reflecting the 70s action movie aesthetic. A montage of high-octane scenes with a retro film grain effect, featuring warm, vintage colors. The sequences are bathed in golden hour light, enhancing the nostalgic feel.

Sound Effects & Music by Wan AI

Generates sound effects and background music that perfectly align with visual content and rhythm.

โธ

Ferrets Entering Water

Prompt: The camera moves rapidly from far to near, with a low angle of view, standing on a log. In the distant view, a white ferret suddenly appears, playing with the log and jumping into the water, then swimming out of the water and sticking its head out. At this moment, the camera zooms in to show a close-up of the white ferret. Several berry trees next to it are splashed with water, moss and snow cover the ground, and the water surface is covered by green fallen leaves. The background is white birch.

โธ

Concert of Wan AI

Prompt: A group of people is performing a symphony in the Vienna Hall.

โธ

Ice Falling

Prompt: A group of people is performing a symphony in the Vienna Hall.

Product Features

Through our product, you can seamlessly leverage our models with a user-friendly experience to access inspiring video content.

Wan AI Open Source

In this repo, we release the code and weights for the Wan 2.1, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation.

The I2V-14B model outperforms leading closed-source models as well as all existing open-source models, achieving SOTA performance. It is capable of generating videos that demonstrate complex visual scenes and motion patterns based on input text and images, including both 480P and 720P resolution models.

Wan2.1-T2V-14B

๐Ÿ˜Š480-720P

The T2V-14B model sets a new SOTA performance among both open-source and closed-source models, showcasing its ability to generate high-quality visuals with substantial motion dynamics. It is also the only video model capable of producing both Chinese and English text and supports video generation at both 480P and 720P resolutions.

Wan2.1-T2V-1.3B

๐Ÿ˜Š480P

The T2V-1.3B model supports video generation on almost all consumer-grade GPUs, requiring only 8.19 GB of BRAM to produce a 5-second 480P video, with an output time of just 4 minutes on an RTX 4090 GPU. Through pre-training and distillation processes, it surpasses larger open-source models and achieves performance even comparable to some advanced closed-source models.

Tech Report

Stay tuned for the upcoming release of our comprehensive technical report for more details.

Built upon the mainstream diffusion transformer paradigm, Wan 2.1 achieves significant advancements in generative capabilities through a series of innovations, including our novel spatio-temporal variational autoencoder (VAE), scalable pre-training strategies, large-scale data construction, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility.

Frequently Asked Questions

1

What is Wan 2.1 by Wan AI and how does it work?

Wan 2.1 by Wan AI is Alibaba Cloud's state-of-the-art video generation model that transforms text descriptions into stunning, high-quality videos. Leveraging advanced technologies like Variational Autoencoders (VAE) and Diffusion Transformers (DiT), it ensures realistic visuals, smooth transitions, and accurate physics for a truly immersive experience.

2

Do I need technical expertise to use Wan 2.1 by Wan AI?

Wan 2.1 by Wan AI is designed with simplicity in mind. Its intuitive interface allows anyone to create professional-quality videos effortlessly, even without advanced technical skills. Whether you're a beginner or a pro, you'll find the platform easy to navigate and use.

3

What types of videos can I create with Wan 2.1 by Wan AI?

Wan 2.1 by Wan AI is versatile and capable of generating a wide range of video content. From dynamic scenes like dancing and sports to educational tutorials and historical video restoration, it empowers you to bring your creative vision to life.

4

How long does it take to generate a video?

The video generation time depends on the complexity and length of your project. For faster results, the Pro version offers accelerated processing speeds, making it ideal for time-sensitive tasks.

5

Can I customize the video output?

Absolutely! Wan 2.1 by Wan AI provides extensive customization options, allowing you to adjust resolution, frame rate, movement complexity, and more. Tailor your videos to meet your specific needs and preferences.

6

What input formats does Wan 2.1 by Wan AI support for video generation?

Wan 2.1 by Wan AI primarily supports text descriptions as input for video generation. You can provide detailed textual prompts describing the scene, actions, and desired visual effects. Additionally, it may support image inputs for enhanced context in future updates.

7

Can Wan 2.1 by Wan AI generate videos in multiple languages?

Yes, Wan 2.1 by Wan AI supports multilingual text inputs, allowing you to generate videos based on descriptions in various languages. However, the quality of output may vary depending on the language and the complexity of the description.

8

Is there a limit to the length of videos that Wan 2.1 by Wan AI can generate?

The length of generated videos depends on the subscription plan. The free version may have limitations on video duration, while the Pro version supports longer and more complex video generation. Specific limits can be found in the platform's documentation.

9

How does Wan 2.1 by Wan AI ensure the quality of generated videos?

Wan 2.1 by Wan AI leverages advanced technologies like Variational Autoencoders (VAE) and Diffusion Transformers (DiT) to ensure high-quality outputs. These technologies enable realistic visuals, smooth transitions, and accurate physics simulations.

10

How does Wan 2.1 by Wan AI handle complex scenes with multiple characters?

Wan 2.1 by Wan AI is designed to handle complex scenes with multiple characters by analyzing the relationships and interactions described in the text input. It uses advanced algorithms to ensure realistic positioning, movements, and interactions between characters.