University of Edinburgh | [email protected]

Thank Hao Peng, Tushar Khot at AI2 for insightful discussions

Started writing on Apr 30 2023

Released on May 01 2023

Last updated on May 09 2023

Other versions: [pdf] [Arxiv] [中文] [bib]

https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lKak1tSTBZVFV4TXpVMVlqUTBOelkwT1RjMVpqZzRaVFpoTkRKa05HVTNOU0lzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklrTnlVbFp3WkVOMWEyRnJNblU1U0hWVVdXUjNJbjA9

Recently, there are many works on smaller models that achieve inspiring dialog abilities, which makes people imagine if smaller models can have comparable performance to large models like GPT-3.5. Generally, language models have multi-dimensional abilities, which makes them hard to compare. Finding the correct metric is crucial for developing strong language models. At the current stage, the community is eager to know what are the key differentiators that mark the potential of strong language models.

In GPT-4 release blog, the authors write: “In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold”. This means that complex tasks are likely to be the key differentiators for large v.s. small language models.

More importantly, complex reasoning opens up opportunities for building a large spectrum of applications upon language models, effectively making language models the next-generation computation platform/ operating system. This has the potential to substantially change the way humans interact with computers and reshape the whole computational ecosystem.

In this post, we take a close look at methods toward models of strong complex reasoning capabilities.

In Astrophotography, when shooting star trails with long exposure, the Polaris, or the North Star, sits at the center of the star trail, always pointing to the true north. In ancient times, it is the star that guides the directions for travelers.

Table of Content

1 - Motivation: LLMs as future-generation computation platform

We study complex reasoning for two reasons:

As mentioned above, complex reasoning is the key differentiator that marks the differences between small and large models, as is discussed by GPT-4 release post.
Complex reasoning is a core ability that makes it possible for the model to become the next-generation operating system.

The vision to make language models the next-generation operating system is particularly interesting because it opens countless possibilities for building new applications and creating a language model based computational ecosystem (probably even larger opportunities than super apps like ChatGPT). The ability of complex reasoning serves as the foundation because if we want the model to become a new OS, it needs to be able to complete complex instructions through interactions with tools, users, and all elements of the outside environment.

This post studies how to train models of strong complex reasoning, how to do prompt engineering to fully release the model’s reasoning ability, and how to evaluate the models’ reasoning performance. The content of this post is divided as:

In section 2, we discuss existing recipes for building language models with strong abilities for complex reasoning. The recipe for complex reasoning is similar to the recipe for generic LLM development, consisting of three stages: continue training, instruction finetuing, and reinforcement learning. We further discuss the intriguing alignment between coding and reasoning.
In section 3, we discuss prompt engineering techniques for complex reasoning. When language models become new-generation operating system kernels, prompt engineering/ in-context learning will become new-generation shell-scripting.
In section 4, we discuss how to evaluate the reasoning abilities of large language models. We introduce chain-of-thought hub, a suite of 100+ reasoning tasks that clearly marks the differences of large v.s. small models. We highlight the promising performance of LLaMA 65B, which we view has a very strong potential as a base model for reproducing ChatGPT-3.5.