Aligning Language Models Using Multi-Objective Deep Reinforcement Learning
AbstractLarge Language Models (LLMs) have been a significant landmark of Artificial Intelligence (AI) advancement. Aligning LLMs to be helpful and harmless is a booming trend in Natural Language Processing (NLP). One of the dominant alignment techniques is reinforcement learning from human feedback (RLHF). RLHF aims to optimize one objective based on human preferences. However, the cost of high-quality human feedback is enormous. Having all human annotators consistent in their opinions on desirable behaviors is also challenging. LLM alignment is intrinsically a multi-objective optimization task since the goal is to train models to be helpful and harmless. It is found that helpfulness and harmlessness sometimes have problems in trade-offs, making it difficult for a model trained toward the optimization of one objective to perform well on both. Therefore, to address the highly potentially conflicting or dominating learning signal problem underlying LLM alignment, a multi-objective deep reinforcement learning (MODRL) methodology is proposed. The MODRL algorithm is based on an adapted deep reinforcement learning Advantage-Induced Policy Alignment (APA) algorithm and the Aligned-MTL approach for multi-task learning. From the overall perspective of helpfulness and harmlessness, language models trained via MODRL perform better than those trained using single-objective deep reinforcement learning methods that consider both objectives.
The following license files are associated with this item:
- Creative Commons
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 International