Research Reports

Development of an Artificial Intelligence Test Harness for the Department of Defense

PUBLIC RELEASE
January 2025

COMPLETED
September 2024

AUTHORS: Dr. Laura Freeman, Dr. Stephen Adams, Dr. Erin Lanus, Dr. Naren Ramakrishnan, Mr. Brian Mayer, Dr. Patrick Butler, Dr. Jaganmohan Chandrasekaran, Dr. William Headley, Mr. Brian Lee, Mr. Alex Kyer
VIRGINIA TECH

One of the core strategic pillars of the Office of the Director, Operational Test and Evaluation (DOT&E) is to pioneer test and evaluation (T&E) methods of weapon systems designed to change over time. Machine learning (ML) and AI models are notably capable of learning and changing over time. Moreover, the stochastic nature of the models that learn based on past data present new challenges for T&E. It is essential to ensure these systems operate effectively, safely, and securely. A reliable test harness that provides high-quality data, AI models, and test and evaluation capabilities will accelerate and inform the development of new methods.

The DOT&E has the responsibility to develop policies for T&E of AI-enabled systems. However, the current state of the AI capabilities and the corresponding T&E methods for AI/ML are evolving. The development of test harnesses has the potential to not only accelerate method development but also inform DOT&E’s policy and guidance. Finally, test harnesses can serve as an educational resource for the T&E community where testers can learn T&E for AI-enabled systems by leveraging tools, processes, and methods in the T&E harness.

In this research, an AIRC team from Virginia Tech designed a framework for an AI Test Harness that could be applied to multiple types of AI models. Along with the framework, the research team developed a set of requirements for an AI Test Harness and produced a simple prototype. The research team then applied the developed framework to two use cases. The first is a Radio Frequency Machine Learning (RFML) use case that uses standard classification models. The second use case focused on Large Language Models (LLMs), a form of generative AI.

The framework and prototype developed through this project serve as an important foundation for future work in AI T&E. It also emphasizes the need for continued investment in research, tool development, and educational resources. By establishing a reliable test harness and a robust set of policies, standards, and metrics, the DOT&E can better equip the T&E community to handle the unique challenges posed by AI and ML systems, ultimately ensuring that these technologies can be deployed safely and effectively in defense applications.