Yaqi Duan | Arxiv_2502

New paper PILAF: Optimal human preference sampling for reward modeling posted on arXiv!
We introduce PILAF, a simple yet effective algorithm for data collection in RLHF, showing its efficiency both theoretically and empirically.