A recent paper published on ArXiv discusses the limitations of evaluating language models based solely on single outputs. It points out that each output is merely one instance from a much larger set of potential completions.
The authors argue that this narrow focus can obscure the true capabilities of language models and hinder effective user interaction. By visualizing and comparing the distributions of outputs, users may gain deeper insights into model performance.
Understanding these distributions could lead to improved evaluation methods, ultimately enhancing how users engage with AI technologies.