HST-SLR: Hierarchical Sub-action Tree for Continuous Sign Language Recognition

Dejie Yang1, Zhu Xu1, Xinjie Gao1, Yang Liu1,2*

1Wangxuan Institute of Computer Technology, Peking University
2State Key Laboratory of General Artificial Intelligence, BIGAI

ICME2025

*Corresponding Author
MY ALT TEXT

Sign language videos display a hierarchy of semantic information, from high-level events (glosses) to fine-grained subactions. However, existing datasets lack detailed annotations for these subactions, posing challenges forimproving the layered understanding of sign language content. To address this,we leverage large language models (LLMs) to generate precise and meaningful descriptions of subactions, thereby enhancing the hierarchical understanding of sign language videos.

Abstract

Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.

Framework

MY ALT TEXT

Our proposed HST-CSLR framework. To explore the fine-grained sub-action information, we use an LLM to generate detailed descriptions for each gloss and construct a Hierarchical Sub-action Tree (HST). An optimal path search algorithm is applied to integrate semantic and temporal information in sub-action sequence, and a hierarchical cross-modal alignment enhances visual-textual consistency using activated tree nodes.

Comparisons with SOTA

  • Experiments on the four datasets
  • Our generalization across German, Chinese, and English
  • Performing effectively on both videos and images

MY ALT TEXT
MY ALT TEXT

Visualization

  • The consistency between our prediction and ground truth shows the effectiveness of our method.

MY ALT TEXT

BibTeX

@inproceedings{
        HST-SLR,
        title     = {HST-SLR: Hierarchical Sub-action Tree  for Continuous Sign Language Recognition},
        author    = {Dejie Yang, Zhu Xu, Xinjie Gao, Yang Liu},
        booktitle = {IEEE International Conference on Multimedia & Expo 2025, {ICME-25} },
        publisher = {IEEE},
        year      = {2025},
      }