Finetune LLMs directly on preference data

Instead of using RL to finetune an LLM you can train an LLM directly on paired preference data

link