The most popular way we evaluate large language models measures the wrong thing: likeability over accuracy and value.