Reward Model Interpretability via Optimal and Pessimal Tokens

Christian B., Kirk HR., Thompson JAF., Summerfield C., Dumbalska T.

DOI

10.1145/3715275.3732068

Type

Conference paper

Publication Date

2025-06-23T00:00:00+00:00

Permalink More information Close