Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the different effects of SFT and RL on generalization and memorization.
Click to jump to each section.
We introduce 2 evaluation tasks to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both textual rule variations and visual variations. SFT, in contrast, tends to memorize training data and struggles to generalize to unseen rules or out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization. We additionally provde a detailed study on the role of verification in RL training.
We leverage two evaluation tasks to study the generalization capabilities of foundation models. Each task spans two modalities: pure language (-L) and vision-language (-VL), where the latter requires the model to process visual input before solving the task. Both tasks contain rule variants and visual variants to evaluate out-of-distribution generalization.
GeneralPoints. We introduce this task to assess the arithmetic generalization capabilities of foundation models. The goal is to create an equation that equals a target number (24 by default) using all 4 numbers from the cards exactly once.
V-IRL. We adopt this task to study the spatial reasoning capabilities in an open-world navigation domain. The goal is to navigate to a target location by following a set of instructions that contain spatial information.
Click to see detailed prompts and transition.
For each task, we separately scale the training compute for RL and SFT on a single rule and test out-of-distribution generalization on unseen rules. We observe that RL demonstrates generalization across all settings. In contrast, SFT tends to overfit the training data.
Since VLMs also incorporate a visual modality,
we next study the effects of visual variation in OOD generalization. We conduct our experiments based on visual variants: for GeneralPoints
, we train the VLM using the black suits (♠, ♣)
and test out-of-distribution performance on the red suits (♥, ♦). For V-IRL
, we train the model on routes collected in New York City and evaluate it on routes from various cities worldwide. We see that RL still significantly outperforms SFT.
To further study RL's effect on visual capabilities , we conduct additional ablation studies in the GP-VL
environment to investigate the OOD performance of RL and SFT,
along with the model's visual recognition accuracy, in terms of recognizing the 4 cards from the input image.
We find that as scaling up post-training compute, RL improves both recognition and overall accuracy, while SFT shows opposite effect.
We apply squential verification-revision to reward the model during RL training. We observe that verification amplifies the generalization capabilities of RL, where OOD performance grows faster as we increase the number of verification iterations.
To conclude, we present a comparative study and observe that SFT memorizes, RL generalizes in multiple modalities and task domains . We also want to note that SFT is still important as warmup of RL training, especially in scenarios that highly require instruction following capabilities. We hope our study can enhance the understanding of post-training techniques and fuel future research in this area.
We sincerely thank NYU VisionX for the nice project page template.
@misc{chu2025sftmemorizesrlgeneralizes,
title={SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training},
author={Tianzhe Chu and Yuexiang Zhai and Jihan Yang and Shengbang Tong and Saining Xie and Dale Schuurmans and Quoc V. Le and Sergey Levine and Yi Ma},
year={2025},
eprint={2501.17161},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2501.17161},
}