Yao Fu, [email protected]

University of Edinburgh

with Tushar Khot and Hao Peng, work done during the internship at

Allen Institute for AI

Thank Aristo teammates, Jingfeng Yang, and Yi Tay for the insightful discussions and suggestions.

Please also check the related blog posts from the CoT team.

Sun Nov 20, 2022

Other versions: [pdf] [google docs] [arxiv] [中文]

https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lJME9UTTROelppTlRWa1pqVTBOemxrT0RBMk9EWm1OamhoTVdGaVpEY3laaUlzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklrTnlVbFp3WkVOMWEyRnJNblU1U0hWVVdXUjNJbjA9

Recently, there has been great interest and progress in showing great abilities in large language models (chain of thought, scratch pad). Collectively referred to as emergent abilities of large language models, these are abilities likely to only exist in large models but not in smaller ones, hence the “emergence” framing. Many of the abilities are quite impressive, like complex reasoning, reasoning with knowledge, and out-of-distribution robustness, as we will look closely below. These abilities are potentially close to what the NLP community have urged for decades, thus representing a potential research paradigm shift from fine-tuning small models to in-context learning with large models. For pioneers, the paradigm shift may be straightforward without the need for justification. Yet, for scientific rigor, we do need very explicit reasons why one should shift to large language models, which are expensive, hard to access, and potentially not as good. In this post, we will scrutinize what these abilities are, what large language models may deliver, and what are their potential advantages in a broader NLP/ ML context.

Table of Content

Prerequisites: we assume the readers have the following knowledge:

The emergent abilities that exist in large models but not in small models.

Figure copy and paste from Wei. et. al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models**.** X-axis means model scale. GSM8K is a primary school level math problem dataset.

Figure copy and paste from Wei. et. al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models**.** X-axis means model scale. GSM8K is a primary school level math problem dataset.

In the above performance figure, we observe that the model performance: