Home | Twitter / X | Google Scholar | Semantic Scholar

I am a research scientist at Google DeepMind.

https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSkRjbEpXY0dSRGRXdGhhekoxT1VoMVZGbGtkeUlzSW5CaFoyVkpaQ0k2SW1JMVpXWmtNbVV3TUdWaE9UUmlaRE5oTVRoallURmhaVGM0WlRZMU5XWTRJbjA9

I did my Ph.D. study at the University of Edinburgh (2020-2024) with professor Mirella Lapata. I finished my M.S. at Columbia University (2018-2020) with professor John Cunningham and my B.S. at Peking University (2013-2018) with professor Yansong Feng. Before Ph.D., I spent great time visiting professor Alexander Rush at Cornell Tech (2019-2020).

During my PhD study, I developed methods for complex reasoning like complexity-based prompting and question decomposition; making smaller models reason better by CoT specialization; self-play multi-agent debate like GPT-Bargaining. My blog poses the connection between code and reasoning in “early days”. I also studied long-context continual pretraining and efficient deployment recipes, and identified retrieval heads that mechanistically explain long-context factuality.

I am interested in large-scale generative models for human intelligence. My research objective is to make large multimodal models the next generation computational platforms and become generally capable agents. I am broadly interested in scaling, long-context, multimodal, reasoning and efficiency.

Featured Research

ICLR 2025 Oral | Retrieval Head Mechanistically Explains Long-Context Factuality [code][paper][Twitter/X]

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng and Yao Fu
A systematic investigation upon a wide range of models reveals the existance retrieval heads, a special type of attention heads accounting for long-context factuality.

ICML 2024 | Data Engineering for Scaling Language Models to 128K Context [code][Paper][Twitter/X]

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim and Hao Peng
An effective and affordable recipe for training language models to 128K context, the key is to continue pretrain the full-attention model on 5B per-source length-upsampled data.
The first open-sourced model matching GPT-4 128K performance on Needle-in-a-Haystack.

Arxiv 2023 | Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback [code][paper]

Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata
Two language models negotiate with each other and continuously improve their negotiation strategies by multi-round game playing and iterative in-context learning from AI feedback.