#1 on Artificial Analysis Video Arena · Elo 1392

HappyHorse AI Video Model

The HappyHorse model — a 40-layer single-stream Transformer with 8-step inference and no CFG — topped the Video Arena in both text-to-video and image-to-video. Generate stunning videos now.

#1 Image-to-Video · Elo 1392 8-Step Inference · No CFG 6 Languages Native Audio + Video Joint

1080p

75 cr

HappyHorse Model Arena Rankings

Blind-test Elo scores from Artificial Analysis Video Arena — HappyHorse 1.0 topped both text-to-video and image-to-video leaderboards.

Text-to-Video (No Audio)

🥇HappyHorse 1.0

Elo 1333

2Seedance 2.0 720p

Elo 1273

3SkyReels V4

Elo 1244

4Kling 3.0 1080p Pro

Elo 1241

5PixVerse V6

Elo 1239

Image-to-Video (No Audio)

🥇HappyHorse 1.0

Elo 1392

2Seedance 2.0 720p

Elo 1355

3PixVerse V6

Elo 1338

4Grok Imagine Video

Elo 1333

5Kling 3.0 Omni Pro

Elo 1297

HappyHorse Model Architecture

A radical rethink of video generation: a single 40-layer Self-Attention Transformer that jointly models text, video, and audio — with no Cross-Attention and no CFG.

40-Layer Single-Stream

One unified Self-Attention Transformer handles text, video, and audio tokens — no Cross-Attention overhead.

8-Step Inference · No CFG

Consistency distillation compresses hundreds of diffusion steps into just 8, with no Classifier-Free Guidance penalty.

6 Languages Native

Chinese, English, Japanese, Korean, German, and French — all natively supported without translation wrappers.

Joint Audio + Video

Sound and picture are the same token sequence. Lip sync, speech coordination, and ambient audio are baked in.

Built for Human-Centric Video

HappyHorse 1.0 is purpose-built for digital humans, short dramas, and avatar content — not just scenic reels. Its image-to-video Elo of 1392 is largest precisely because animating a real face is its core strength.

Facial PerformanceNuanced micro-expression and emotion transfer from a single photo

Lip SyncAccurate mouth motion matched to generated or provided speech

Body MotionNatural full-body movement without the uncanny-valley stiffness

Speech CoordinationAudio and gesture synchronized end-to-end, no post-processing needed

HappyHorse Model Specs

Model TypeText+Image → Video (w/ Audio)

Architecture40-layer single-stream Self-Attention

Inference Steps8 steps · No CFG

LanguagesZH · EN · JA · KO · DE · FR

Best AtImage-to-Video · Elo 1392 (#1)

ReleasesBase · Distilled · Super-Resolution

OriginAsian team · Year of the Horse 2026

HappyHorse Model FAQ

What is the HappyHorse AI model?

HappyHorse 1.0 is a text-and-image-to-video generation model that topped the Artificial Analysis Video Arena in April 2026, achieving Elo 1392 in image-to-video and Elo 1333 in text-to-video — beating Seedance 2.0, Kling 3.0, and PixVerse V6.

What makes HappyHorse model architecture unique?

It uses a 40-layer single-stream Self-Attention Transformer with no Cross-Attention. All modalities — text, video, and audio — share one token sequence and one attention space. Inference requires only 8 denoising steps and no Classifier-Free Guidance (CFG).

Which languages does HappyHorse support?

HappyHorse 1.0 natively supports six languages: Chinese, English, Japanese, Korean, German, and French — with no translation wrapper required.

Why did HappyHorse disappear from the Arena leaderboard?

Both V1 and V2 were removed within days of topping the Arena. The most likely explanations are either an anonymous A/B test run by the developer team, or a deliberate early-access withdrawal before an official open-source release.

Is HappyHorse model open source?

The official site states that the base model, distilled model, super-resolution model, and inference code will be fully open-sourced. As of early April 2026, the GitHub and model hub pages are marked "Coming Soon."

Built for Human-Centric Video

Facial PerformanceNuanced micro-expression and emotion transfer from a single photo

Lip SyncAccurate mouth motion matched to generated or provided speech

Body MotionNatural full-body movement without the uncanny-valley stiffness

Speech CoordinationAudio and gesture synchronized end-to-end, no post-processing needed