Synthetic data is supposed to solve DoD's classified training problem. Here's why it's harder than anyone wants to admit.

The Department of Defense has a fundamental problem with modern AI development: most of the interesting data is classified, and you can't train models on classified information using commercial cloud platforms. This isn't a policy preference—it's a hard constraint that makes the entire Silicon Valley AI playbook unusable for defense applications.
Synthetic data generation is becoming the DoD's answer. But as I've learned working on Navy ERP systems and audit readiness initiatives, the gap between "we generated synthetic data" and "this data is operationally useful" is massive.
Here's the constraint: DoD Impact Level 5 (IL5) and above systems—which handle classified data up to Secret—require air-gapped or highly controlled environments. Commercial AI platforms operate in IL2-IL4 environments at best. Azure Government's IL5 and AWS GovCloud exist, but they're expensive, feature-limited, and still don't give you the full suite of modern AI development tools.
The practical result? You can't:
This creates a brutal paradox: the data that would make defense AI actually useful is the data you legally cannot use for training.
Synthetic data generation promises to solve this by creating artificial datasets that preserve the statistical properties and operational characteristics of real classified data without containing actual classified information. In theory, you generate synthetic data that "looks like" your classified operational data, declassify the synthetic set, and train models on commercial infrastructure.
The technical approaches fall into three categories:
GANs use two neural networks in competition: a generator creates synthetic samples, and a discriminator tries to distinguish synthetic from real data. The generator improves until the discriminator can't tell the difference.
Why DoD cares: GANs can generate realistic imagery (satellite reconnaissance), sensor data (radar signatures), and structured data (logistics records) that preserve complex correlations.
The problem: GANs are notoriously unstable to train and prone to mode collapse—they end up generating variations of a few examples rather than the full distribution. For defense applications where edge cases matter (rare threat scenarios, unusual operational conditions), mode collapse is catastrophic.
Diffusion models work by gradually adding noise to real data, then learning to reverse the process. This approach has proven more stable than GANs and currently powers most state-of-the-art generative AI (Stable Diffusion, DALL-E).
Why DoD cares: Diffusion models can generate high-fidelity synthetic data across multiple modalities—imagery, text, time-series sensor data. They're also more controllable than GANs, allowing conditional generation based on specific requirements.
The problem: Training diffusion models requires enormous compute and large datasets. If you only have a small classified dataset (common in defense), you don't have enough samples to train a good diffusion model. You end up needing synthetic data to train the model that generates synthetic data—a chicken-and-egg problem.
Instead of learning from data, simulation-based approaches use physics models, game engines, and procedural generation to create synthetic scenarios. Think flight simulators, wargaming engines, and digital twins.
Why DoD cares: Simulations can generate unlimited training data for scenarios that rarely occur in real operations—contested environments, multi-domain operations, novel threat vectors. The data is inherently unclassified because it's generated from first principles.
The problem: Simulation fidelity is hard. A synthetic radar signature generated from a physics model might be technically accurate but miss the weird environmental artifacts and sensor quirks that real operational data contains. Models trained on "clean" simulated data often fail when faced with messy reality.
Here's where theory meets the wall of operational reality. How do you validate that synthetic data is operationally useful without comparing it to the classified data you're trying to protect?
Synthetic data needs to preserve:
Current validation approaches are inadequate:
Distance metrics (Kullback-Leibler divergence, Wasserstein distance) can measure statistical similarity but don't capture whether the synthetic data contains the operationally critical patterns. You can have statistically similar data that's useless for the actual mission.
Holdout testing requires access to classified data to validate—which defeats the purpose of using synthetic data in the first place. You end up needing a cleared development team with IL5 access to validate the synthetic data, then a separate uncleared team to do the actual model development. This doubles your development overhead.
Adversarial testing is crucial but underutilized. Red teams should actively try to find differences between models trained on synthetic vs. real data. If adversaries can distinguish between them, your synthetic data isn't good enough.
Even synthetic data isn't automatically unclassified. The process of generating synthetic data from classified sources can itself reveal classified information.
Consider: if you train a GAN on classified satellite imagery and the generator produces realistic synthetic images, those synthetic images might reveal classified capabilities—resolution, spectral bands, revisit rates. The synthetic data inherits classification from the generation process.
The current approach requires a classification review of synthetic datasets before they can be released for unclassified development. This review is manual, expensive, and creates a bottleneck. I've seen classification reviews take 6+ months for datasets that were supposed to accelerate development.
Some organizations are exploring automated classification tools that scan synthetic data for potential classification spillage, but these tools are conservative by necessity. They tend to over-classify, defeating the purpose of synthetic data generation.
Synthetic threat data for training intrusion detection and malware classification models shows promise. You can generate synthetic network traffic, malware variants, and attack patterns without exposing actual threat intelligence.
The catch: Adversaries evolve. Synthetic data based on historical threats doesn't capture novel attack vectors. You end up fighting the last war.
Synthetic operational scenarios for training course-of-action recommendation systems can work, but fidelity is the constant challenge. Simulated operations lack the friction, uncertainty, and human factors of real plans.
The catch: Models trained on synthetic planning data tend to be overconfident and brittle. They optimize for the simulation, not reality.
Supply chain and logistics data is well-suited for synthetic generation. Logistics follows relatively consistent patterns (demand forecasting, routing, inventory), and simulation-based approaches can generate realistic scenarios.
The catch: Real logistics data contains supplier relationships, lead times, and cost structures that are themselves sensitive. Synthetic data that preserves these relationships without revealing actual vendors is hard.
Generating synthetic communications data that preserves linguistic patterns, social networks, and operational security practices is extremely difficult. Language models trained on synthetic comms tend to produce stilted, unrealistic text.
The catch: Real operational communications have context, abbreviations, and domain-specific jargon that's hard to synthesize. Models trained on synthetic comms fail when deployed on real intercepts.
The DoD's focus on synthetic data is understandable, but it's not the only approach to the classified training problem.
Differential privacy adds calibrated noise to datasets to provide mathematical privacy guarantees. Instead of generating fully synthetic data, you can train models on differentially private versions of classified data.
The advantage: You work with real data patterns, not synthetic approximations. Models trained with differential privacy maintain stronger performance characteristics.
The disadvantage: Differential privacy assumes you can quantify the privacy loss you're willing to accept. For classified data, there's no meaningful way to say "we'll accept ε=1.0 privacy loss"—classification is binary, not probabilistic.
Federated learning trains models across distributed datasets without centralizing the data. You could train a model across multiple classified enclaves without exposing the underlying data.
The advantage: Models learn from real classified data while maintaining data isolation. This preserves the operational fidelity that synthetic data struggles with.
The disadvantage: Federated learning requires infrastructure across multiple security domains, assumes you have multiple relevant datasets (often you don't), and introduces synchronization and communication overhead.
Technologies like Intel SGX and AMD SEV provide hardware-based trusted execution environments. You could theoretically train models on classified data inside secure enclaves on commercial cloud infrastructure.
The advantage: You get access to commercial AI platforms and tools while maintaining data isolation.
The disadvantage: Secure enclaves have performance limitations, limited memory, and complex certification requirements. Getting IL5 certification for confidential computing environments is still theoretical, not practical.
Based on my work with Navy financial systems and audit readiness, the practical answer isn't synthetic data alone—it's hybrid approaches that combine multiple techniques:
The DoD needs to stop looking for a single silver bullet and start building integrated data pipelines that use the right technique for each use case.
If you're building synthetic data capabilities for defense applications, here's what I've learned:
Don't try to generate perfect synthetic data. Generate large volumes of "good enough" synthetic data and iterate based on downstream model performance. Perfect is the enemy of done.
Every synthetic dataset should come with a validation report: statistical properties, operational coverage, downstream task performance. Make validation automated and continuous, not a one-time check.
The team generating synthetic data should not be the same team doing classification review. You need independent validation that synthetic data doesn't leak classified information.
Your first synthetic dataset will be inadequate. Plan for multiple generations as you learn what operational characteristics matter and which can be approximated.
When synthetic data fails, you need to know why. Was it statistical divergence? Missing edge cases? Wrong operational assumptions? Without detailed documentation, you'll repeat the same mistakes.
Synthetic data generation is not a solved problem for defense AI applications. The DoD is mandating it because there's no good alternative for training on classified data, not because the technology is mature.
The vendors selling synthetic data solutions are overselling capabilities and underselling validation challenges. The fidelity gap between synthetic and real operational data is larger than anyone wants to admit.
If you're building defense AI systems, assume synthetic data will get you 70-80% of the way there. The remaining 20-30% will require access to real classified data, cleared development teams, and IL5 infrastructure. Budget and plan accordingly.
The synthetic data mandate is the DoD acknowledging that modern AI development and classification requirements are fundamentally incompatible. We're building workarounds, not solutions. The sooner we're honest about that, the better systems we'll build.
Amyn Porbanderwala is Director of Innovation at Navaide, where he works on Navy ERP systems and financial audit readiness. He previously served as a Cyber Network Operator in the Marine Corps Reserve and holds a CISA certification. Views expressed are his own.