Youhe Jiang
PhD student in Computer Science at the University of Cambridge, advised by Dr. Eiko Yoneki. I build systems that make modern AI workloads more efficient at scale — spanning LLM serving, heterogeneous and decentralized systems, distributed training, and communication-aware optimisation.
Publications
Conference Papers
-
OSDI 2026LLMFabric: Unifying Decentralized HPC Clusters for Heterogeneous LLM Serving
-
MLSys 2026BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization
-
ICLR 2026Cascadia: A Cascade Serving System for Large Language Models
-
ICDE 2026Hexgen-Text2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflow
-
ICLR 2026FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel
-
MLSys 2026HexiScale: Accommodating Large Language Model Training over Heterogeneous Environment
-
NeurIPS 2025Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on 9600+ GPUs
-
ICML 2025Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
-
MLSys 2025ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
-
ICLR 2025HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment
-
ICML 2024HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment
-
TKDE 2024Improving Automatic Parallel Training via Balanced Memory Workload Optimization
-
VLDB 2023Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
-
IJCAI 2023OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
-
IEEE Access 20202D-HRA: Two-Dimensional Hierarchical Ring-Based All-Reduce Algorithm in Large-Scale Distributed ML
Preprints
-
arXiv 2026Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
-
arXiv 2026OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
-
arXiv 2026Efficient Multi-round LLM Inference over Disaggregated Serving
-
arXiv 2026SLA2: Sparse-Linear Attention with Learnable Routing and QAT
-
arXiv 2025Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately
-
arXiv 2025AReaL-Hex: Accommodating Asynchronous RL Training over Heterogeneous GPUs
-
arXiv 2025Parallax: Efficient LLM Inference Service over Decentralized Environment
-
arXiv 2025Efficient Mixed-Precision Large Language Model Inference with TurboMind
Experience
University of Cambridge
PhD in Computer Science · Advisor: Dr. Eiko Yoneki
HKUST
Research Assistant · Advisor: Dr. Binhang Yuan
Peking University
Research Assistant · Advisor: Dr. Bin Cui
Tsinghua University · Kuaishou · Baidu
R&D Intern · Distributed training systems and deep learning infrastructure
Xidian University
BEng in Telecommunication Engineering · Advisor: Dr. Huaxi Gu
Open Source
Parallax
Decentralized LLM serving framework for efficient inference across distributed, heterogeneous environments.
FSA
Efficient native sparse attention kernel for improved sparse-attention model execution.
Galvatron
Automatic distributed training system for large transformers with multi-GPU parallelism.
AReaL
Large-scale asynchronous reinforcement learning system.
Hetu
High-performance distributed deep learning system for large-scale training.
LMDeploy
Toolkit for compressing, deploying, and serving LLMs with mixed-precision inference.
PaddlePaddle
Industrial deep learning platform covering model development, training, and deployment infrastructure.
Bagua
PyTorch training acceleration framework for efficient large-scale distributed learning.