Video generation has entered the "all-round" era.

Category:AI newsPublish Time:2026-02-02 16:07:47Page Views:257

On January 29th, Skywork AI officially open - sourced its self - developed video generation model, SkyReels - V3. As a series of multi - modal video generation models, this series supports three core capabilities: Reference Images - to - Video, Video Extension, and Talking Avatar. It achieves high - fidelity multi - modal video generation within a single modeling architecture, reaching the industry's leading level.


The three core capabilities serve as independent modules, each deeply optimized and supporting flexible combination. The Skywork AI team achieved this through technologies such as enterprise - level data processing, ultra - fast inference capabilities, and an efficient training architecture, enabling the generated videos to reach a professional - grade effect, with multiple indicators reaching or exceeding the industry's leading level.


SkyReels - V3 can generate high - quality video sequences with consistent timing and semantics based on 1 to 4 reference images combined with text prompts. Whether it's a character image, product display, or background scene, the generated videos can accurately preserve the original identity features, spatial composition, and narrative coherence.


Behind it are multiple technological innovations of the Skywork AI team in data construction, multi - reference condition fusion, and hybrid training strategies:

1. High - quality data construction: The team screened materials with significant dynamic movements from a vast amount of videos and adopted a cross - frame pairing strategy to ensure temporal diversity. More importantly, they used an image editing model to extract the main subject area, complete the background, and rewrite semantics, effectively avoiding the common "copy - paste" artifacts and ensuring the generation quality from the data source.

2. Multi - reference condition fusion: The model uses a unified strategy to jointly encode visual and text information, supporting up to 4 reference images. This means that users can achieve natural interaction and scene combination of complex multi - subjects and multi - elements without complex image stitching or manual masking. For example, in the e - commerce scenario, product images can be combined with virtual anchor images to directly generate a product - promotion video in a specific environment, accurately preserving product details and the anchor's identity features.

3. Hybrid training strategy: The team used image - video hybrid training, jointly leveraging large - scale image and video datasets, and used multi - resolution joint optimization to improve the robustness of different spatial scales and aspect ratios.

In an evaluation of a mixed test set containing 200 pairs (covering multiple fields such as movies, TV, e - commerce, and advertising), SkyReels - V3 demonstrated excellent performance.

Facing various reference types such as people, animals, objects, and background scenes, SkyReels - V3 achieved a reference consistency index of 0.6698, surpassing mainstream commercial models such as Vidu Q2 (0.5961), Kling 1.6 (0.6630), and PixVerse V5 (0.6542). It also led the field with a score of 0.8119 in the visual quality index, proving its powerful ability to generate high - fidelity videos while maintaining reference features.