Yao Fu | Website | Blog

University of Edinburgh | [email protected]

Started writing in Aug 20 2023

Released in Sep 05 2023

Thank for insight discussions with Ziming Liu @MIT, Junxian He @HKUST, Hao Peng @UIUC, and Ruibo Liu @Google DeepMind

https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSkRjbEpXY0dSRGRXdGhhekoxT1VoMVZGbGtkeUlzSW5CaFoyVkpaQ0k2SW1Sak5EZ3daRGxpWmpkbVpqUTJOVGxoWm1RNFl6bG1ZamN6T0RBNE5tVmlJbjA9

Recently, the focus of research and open-source community are gradually shifting from model engineering to data engineering, realizing the crucial importance of data quality. However, when we say “we want better data”, what does “better data” mean precisely? When we say “we optimize the data composition”, what is the objective that we are optimizing?

We would like to study the theoretical support for the language modeling data engineering. We believe that in-depth understanding of the problem is equally important as developing methods to solve the problem, and theoretical analysis will lead us to predictable scaling: to predict the eventual performance on every single task before we actually run the experiments.

In this post, we aggregate recent insights about data engineering, and give problem formulations for data optimization — this is to say, we do NOT propose specific methods for optimizing data, but we discuss what is the problem we should solve when optimizing data, and the underlying principles that guide us. Specifically, we discuss the following objectives for pretraining and SFT data optimization:

For pretraining data optimization:

<aside> 💡 Find the optimal mix ratio + data format + data curriculum such that the speed of learning is maximized.

</aside>

For supervised finetuning / instruction tuning data optimization:

<aside> 💡 Find the minimal combination of query-response pairs such that their coverage of the user preference distribution is maximized

</aside>

Combining the insights from pretraining and finetuning, we would further find ways to connect the two to achieve predictable scaling, that is, given the pretrain and finetune data, given the model architecture and training hyperparameters, we would like to predict all the results before doing the experiments. Specifically,

For predictable scaling:

<aside> 💡 Find the scaling law where given the input as: pretrain data mixture + model scale + finetune data mixture + training algorithm such that the pretrain dev loss, downstream performance, human preference can be predicted before all experiments

</aside>

Table of Content

1 - The problem of pretrain data optimization

We first clarify that although the final next-word prediction loss is typically used for measure pretraining, this is not the objective when we optimize the pretrain data mixture. We are not directly optimizing the final loss, but we are optimizing the speed of loss decrease — we would like the the derivative of the learning curve to be larger, thus a steeper curve. To measure the speed of learning, instead of using the next-word prediction loss (which is quite informative given its connection to lossless compression), we consider the speed of grokking — how fast the model can obtain a specific skill — might be a good alternative metrics. We will circle back to the connection of grokking and next word prediction loss later.

2 - Using the speed of grokking as measurement

2.1 - What is grokking

Grokking means at a certain step of training, the model suddenly learns a skill and transits from memorization to generalization

Grokking means at a certain step of training, the model suddenly learns a skill and transits from memorization to generalization