Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Rocher L., Bean A., Othniel Kearns R., Hafner F., Mayne H., Korgul K., Batra H., Deb O., Emde C., Foster T., Ibrahim L., Kim H., Kirk H., Lin F., Magomere J., Rystrom J., Yang Y., Bibi A., CLARK R., FOERSTER J., GAL Y., HALE S., SUMMERFIELD C., TORR P., MAHDI A.

Type

Conference paper

Publication Date

2025-12-05T00:00:00+00:00

Permalink More information Close