In this paper, we focus on motion discrete tokenization, which converts raw motion into compact discrete tokensโa process proven crucial for efficient motion generation. In this paradigm, increasing the number of tokens is a common approach to improving motion reconstruction quality, but more tokens make it more difficult for generative models to learn. To maintain high reconstruction quality while reducing generation complexity, we propose leveraging language to achieve efficient motion tokenization, which we term Language-Guided Tokenization (LG-Tok). LG-Tok aligns natural language with motion at the tokenization stage, yielding compact, high-level semantic representations. This approach not only strengthens both tokenization and detokenization but also simplifies the learning of generative models. Furthermore, existing tokenizers predominantly adopt convolutional architectures, whose local receptive fields struggle to support global language guidance. To this end, we propose a Transformer-based Tokenizer that leverages attention mechanisms to enable effective alignment between language and motion. Additionally, we design a language-drop scheme, in which language conditions are randomly removed during training, enabling the detokenizer to support language-free guidance during generation. On the HumanML3D and Motion-X generation benchmarks, LG-Tok achieves Top-1 scores of 0.542 and 0.582, outperforming state-of-the-art methods (MARDM: 0.500 and 0.528), and with FID scores of 0.057 and 0.088, respectively, versus 0.114 and 0.147. LG-Tok-mini uses only half the tokens while maintaining competitive performance (Top-1: 0.521/0.588, FID: 0.085/0.071), validating the efficiency of our semantic representations.
Given motion and text, a frozen text encoder extracts embeddings which are fed into a Transformer-based tokenizer with learnable latent tokens to produce semantic motion tokens. These are quantized into discrete codes for generative modeling. During detokenization, dequantized embeddings and text interact via cross-attention in the detokenizer to reconstruct motion. For generation, tokens from the trained model are decoded to synthesize high-fidelity human motion.
The complete tokenization-generation-detokenization pipeline follows: LG-Tok first tokenizes motion into multi-scale discrete tokens through language-guided encoding and multi-scale quantization. These tokens then enable Scalable AutoRegressive (SAR) modeling, which predicts tokens scale-by-scale rather than one-by-one, significantly improving generation efficiency. Finally, the generated tokens are dequantized and decoded back to motion through our Transformer-based detokenizer with language guidance.
We evaluate our approach on HumanML3D and Motion-X datasets. LG-Tok achieves superior performance with Top-1 R-Precision of 0.542 and FID of 0.057 on HumanML3D, surpassing the previous best method MoSa (0.518 and 0.064). On Motion-X, LG-Tok-mid reaches 0.591 Top-1 R-Precision. Notably, LG-Tok-mini maintains competitive results with only half the tokens (0.521 Top-1 R-Precision), validating the efficiency of our language-guided semantic representations.
StableMoFusion
MARDM
MoSa
LG-Tok (Ours)
StableMoFusion
MARDM
MoSa
LG-Tok (Ours)
StableMoFusion
MARDM
MoSa
LG-Tok (Ours)
a man turns to his left, then goes down onto a crawl, moving around on his hands and feet
a person quickly runs forward and stoops to pick up an object before carrying it off
a person takes one step forward with their right foot and then with their left to end with their feet side by side
the person is balancing on one leg using his hands to help balance
๐ง Coming Soon! ๐ง
We provide motion editing results to demonstrate that LG-Tok's benefits extend beyond standard generation to constrained tasks. Following protocols in generative models, we pass an edit mask during detokenization to enable language guidance only in designated regions. This allows precise temporal control for partial generation tasks such as motion editing and in-betweening.
@article{yan2026language,
title={Language-Guided Transformer Tokenizer for Human Motion Generation},
author={Yan, Sheng and Wang, Yong and Du, Xin and Yuan, Junsong and Liu, Mengyuan},
journal={arXiv preprint arXiv:2602.08337},
year={2026}
}