The Standardised Test Suite: A Method for Evaluating Agents in Human-Computer Interactions
Training artificial agents to interact seamlessly with humans is a challenging task. How do we measure progress in such complex interactions? The answer lies in the development of a method known as the Standardised Test Suite (STS). This innovative approach allows for the evaluation of agents in temporally extended, multi-modal interactions.
The STS methodology involves placing agents in behavioral scenarios derived from real human interaction data. These scenarios are replayed for the agents, who then receive instructions and are tasked with completing the interaction offline. Human raters then annotate these agent continuations as either successful or unsuccessful. Agents are ranked based on their success rate across different scenarios.
Challenges in Human-Agent Interaction
Human interactions involve nuances and subtleties that are difficult to formalize. Traditional reinforcement learning methods used in games like Atari or Go are not sufficient for teaching agents to engage in fluid and successful interactions with humans. The difference between answering “Who won this game of Go?” and “What are you looking at?” illustrates the complexity of human communication that cannot be easily coded into algorithms.
While interactive evaluation by humans can provide valuable insights into agent performance, it is cumbersome and time-consuming. Previous evaluation methods, such as scripted probe tasks, have limitations in correlating with actual interactive evaluation. The STS methodology offers a more controlled and efficient way to assess agent performance in human-agent interactions.
Automating Evaluation with STS
Just as datasets like MNIST and ImageNet have revolutionized machine learning, the STS methodology aims to streamline the evaluation process for human-agent interactions. While human annotation is still required for agent continuations, there is potential for automation in the future. This would enable rapid and effective evaluation of interactive agents, speeding up progress in this research area.