University of Edinburgh | [email protected]

June 29th 2023

https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSkRjbEpXY0dSRGRXdGhhekoxT1VoMVZGbGtkeUlzSW5CaFoyVkpaQ0k2SW1ZMU9XUmlabU16Tm1VeVpEUmxNVEpoTXpNME5ETmlaRFppTWpBeE1tTXlJbjA9

Following the great success of ChatGPT, on February 24, 2023, the emergence of LLaMA heated up the direction of instruction tuning. On March 18, Alpaca demonstrated the potential of distilling smaller models from mature ones to become decent ChatBots, triggering a Cambrian explosion of llama-based models. However, just three months later, people began to recognize the various problems of training LLaMA with ChatGPT's data. This article reviews the development of LLaMA-based models over the past three months and discusses the next challenges of Instruction Tuning.

Disclaimer: This article is essentially a quick research memo, edited from the outline of my recent presentation, with some cuts and additions. Currently, there are many things the open-source community is unclear about building LLMs. I have tried my best to ensure that the content I refer to or discuss has solid evidence, rather than being based on rumors. Much of the content comes from direct discussions with the original authors of the corresponding papers. Even so, my take may still be very wrong and there may be many unresolved issues, so please feel free to comment directly beside the article and participate actively in the discussion — I will keep all the comments that point out my errors. The truth becomes clearer with debate.

Table of Contents

1 - The origin

The first three papers:

Natural Instructions v1: Cross-task generalization via natural language crowdsourcing instructions
- The very beginning. Initially released in April 2021. Two years ahead of LLaMA. Extremely visionary!!!
FLANv1: Finetuned Language Models Are Zero-Shot Learners
T0: Multitask Prompted Training Enables Zero-Shot Task Generalization
InstructGPT: Training language models to follow instructions with human feedback

Comparisons:

The goal of InstructGPT is alignment, with zero-shot and cross-lingual being side effects.
- This paper used a 7B Reward model corresponding to a 175B Policy model, which was then followed by DeepSpeed Chat and a series of subsequent RL open-source works. This approach should be incorrect.
- The correct approach should be to scale up the Reward model to reduce the size of the Policy model, as seen in Scaling Laws for Reward Model Overoptimization — that is, reversing the sizes of the two models, using a 175B Reward to PPO a 7B policy.
- For the model to be deployed online, 10 - 50B is a more affordable scale, anything larger would be too costly.
The goals of FLANv1 and T0 are zero-shot, so they are not for alignment.

Then there's Self-instruct:

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Important points about self-instruct:

The base model can be arbitrary (any pretrained checkpoint); it doesn't need to be a model (ChatGPT) that has undergone alignment.
It reproduces the process from the first generation Davinci to text-Davinci-001 — incredibly insightful!!