Towards more human-like language models based on contextualizer pretraining strategy

Published in EMNLP/CoNLL 2023 BabyLM Challenge, 2023

Taking inspiration from human children learning, we pose a question: can a “baby language model” gradually internalize a concept by exposing itself to the concept in unlimited, often-times irrelevant contexts, and what this means to limited pretraining resource (both data-wise and GPU-wise).

Throughout the study, we restrict our experiments to two data-limited settings, 10M and 100M tokens, which are respectively 1/3000 and 1/300 to what were available to the training of RoBERTa. Our best performing training recipe performs within 1.2% of RoBERTa, and on-par with BERT, on the BLiMP zero-shot linguistic knowledge benchmark, using 1/300 RoBERTa’s pretraining data and can be trained on only 1 GPU in 4 days, trained for only 1 epoch.