Scaling law is not working?

Posted on 2024-04-23 Edited on 2024-04-27

Looking at various LLM models developed in the past few months, we are absolutely pursuing LLM with larger number of parameters. This is counterintuitive to the concept of ‘parsimony’ in traditional machine learning, which means using a minimum number of parameters necessary to train the model therefore preventing overfitting problem on unseen data. But as we’ve all seen that in computer vision and natural language processing, generative models with larger number of parameters tend to outperform their more parsimonious counterparts. There are probably two main hypotheses: 1) increasing the amount of data can improve the performance of LLM, then the huge size of dataset requires large number of parameters otherwise the model is going to be underfitting, and 2) large number of parameters allow the model to generate high-resolution pictures, provide more detailed answers, gain the capability of solving more intricate problems.

The paper “Scaling Law for Neural Language Models” published in 2020 by OpenAI introduced the concept of scaling law - the performance of NLM is highly correlated with the scale of network, the size of dataset, and compute, but weakly correlated with model architecture (number of layers, depth, and width). There’s been a debate on this scaling law. OpenAI believes that parameter size is more important - for every x10 increase of compute, dataset size should be increased by x1.8, and parameter size should be increased by x5.5. However, DeepMind believes that they are equally important - for every x10 increase of compute, dataset size and parameter size should both be increased by x3.16.

The paper “Bigger is not Always Better: Scaling Properties of Latent Diffusion Models” published by Google this month summarized an investigation of scaling law on Latent Diffusion Models (LDMs) specifically. It points out that “when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results”. It turns out that with the same compute capacity, smaller model with large number of training steps can outperform larger model with small number of training steps. The existence of trade-off here is due to the fact that total compute is proportional to the number of training steps times GFLOPS (number of rounds of one forward propagation and one back propagation). But with the same number of training steps, it’s still true that the larger model the better performance. As the expense is calculated roughly by the parameter size times dataset size, this finding opens up new pathways to enhance LDM’s generative capacity within limited inference budgets.

Reference:

Scaling Laws for Neural Language Models [Link]
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models [Link]