Language-Guided Transformer Tokenizer for Human Motion Generation

In this paper, we focus on motion discrete tokenization, which converts raw motion into compact discrete tokens—a process proven crucial for efficient motion generation. In this paradigm, increasing the number of tokens is a common approach to improving motion reconstruction quality, but more tokens make it more difficult for generative models to learn. To maintain high reconstruction quality while reducing generation complexity, we propose leveraging language to achieve efficient motion tokenization, which we term Language-Guided Tokenization (LG-Tok). LG-Tok aligns natural language with motion at the tokenization stage, yielding compact, high-level semantic representations. This approach not only strengthens both tokenization and detokenization but also simplifies the learning of generative models. Furthermore, existing tokenizers predominantly adopt convolutional architectures, whose local receptive fields struggle to support global language guidance. To this end, we propose a Transformer-based Tokenizer that leverages attention mechanisms to enable effective alignment between language and motion. Additionally, we design a language-drop scheme, in which language conditions are randomly removed during training, enabling the detokenizer to support language-free guidance during generation. On the HumanML3D and Motion-X generation benchmarks, LG-Tok achieves Top-1 scores of 0.542 and 0.582, outperforming state-of-the-art methods (MARDM: 0.500 and 0.528), and with FID scores of 0.057 and 0.088, respectively, versus 0.114 and 0.147. LG-Tok-mini uses only half the tokens while maintaining competitive performance (Top-1: 0.521/0.588, FID: 0.085/0.071), validating the efficiency of our semantic representations.

Method Overview

1 📋 LG-Tok Architecture Description

Given motion and text, a frozen text encoder extracts embeddings which are fed into a Transformer-based tokenizer with learnable latent tokens to produce semantic motion tokens. These are quantized into discrete codes for generative modeling. During detokenization, dequantized embeddings and text interact via cross-attention in the detokenizer to reconstruct motion. For generation, tokens from the trained model are decoded to synthesize high-fidelity human motion.

2 🔄 Complete Pipeline

The complete tokenization-generation-detokenization pipeline follows: LG-Tok first tokenizes motion into multi-scale discrete tokens through language-guided encoding and multi-scale quantization. These tokens then enable Scalable AutoRegressive (SAR) modeling, which predicts tokens scale-by-scale rather than one-by-one, significantly improving generation efficiency. Finally, the generated tokens are dequantized and decoded back to motion through our Transformer-based detokenizer with language guidance.

Experimental Results

We evaluate our approach on HumanML3D and Motion-X datasets. LG-Tok achieves superior performance with Top-1 R-Precision of 0.542 and FID of 0.057 on HumanML3D, surpassing the previous best method MoSa (0.518 and 0.064). On Motion-X, LG-Tok-mid reaches 0.591 Top-1 R-Precision. Notably, LG-Tok-mini maintains competitive results with only half the tokens (0.521 Top-1 R-Precision), validating the efficiency of our language-guided semantic representations.

Comparisons

a person walks to the left, then to the right, then back to their original position in the middle

StableMoFusion

MARDM

MoSa

LG-Tok (Ours)

The boxer dodges quickly, shifting their weight from side to side

StableMoFusion

MARDM

MoSa

LG-Tok (Ours)

a person walks a few steps forward and then starts dancing as if with a partner and then turns to the right

StableMoFusion

MARDM

MoSa

LG-Tok (Ours)

Gallery of Generation

a man turns to his left, then goes down onto a crawl, moving around on his hands and feet

a person quickly runs forward and stoops to pick up an object before carrying it off

a person takes one step forward with their right foot and then with their left to end with their feet side by side

the person is balancing on one leg using his hands to help balance

Apply in Motion Editing

🚧 Coming Soon! 🚧

We provide motion editing results to demonstrate that LG-Tok's benefits extend beyond standard generation to constrained tasks. Following protocols in generative models, we pass an edit mask during detokenization to enable language guidance only in designated regions. This allows precise temporal control for partial generation tasks such as motion editing and in-betweening.

BibTeX

@article{yan2026language,
  title={Language-Guided Transformer Tokenizer for Human Motion Generation},
  author={Yan, Sheng and Wang, Yong and Du, Xin and Yuan, Junsong and Liu, Mengyuan},
  journal={arXiv preprint arXiv:2602.08337},
  year={2026}
}