
AI Model training requires vast amounts of data which are normally not readily available. As such, generating data sets artificially has become a need and an enterprise in itself. There are many pros and cons to using synthetic data, but let’s start with a definition:
“A synthetic data set is any data set that represents a real use data case but it has been generated – using different methodologies – rather than collected.”
Synthetic data generation involves a range of techniques and tools that enable the creation of artificial datasets. These techniques vary depending on the specific requirements and characteristics of the target dataset. Some commonly used techniques include generative adversarial networks (GANs), variational autoencoders (VAEs), and rule-based approaches. GANs, for example, involve training a generator model to produce synthetic data that closely resembles the real data, while VAEs leverage probabilistic models to generate new data samples.
Benefits of using synthetic data for model training
Specificity – certain models require very specific data sets. By generating them first and seeing how your models react, you can derive the requirements for more precise data collection strategies.
Balanced data – There are times when you need to train a model with a well distributed data set. For example, you want to train your model with 50% females and 50% males, or the same percentages of age distributions, etc.
PII – Personal Identifiable Information can be a security problem when it comes to training data sets. You have to be sensitive to the security and privacy of your customers. Synthetic data sets have fictitious PII, which, if it gets exposed, has no security or privacy risks.
Quantity – You can generate as much data as you want/need.
Quality – Since data is programmatically generated, if the algorithm is developed correctly, then no piece of data should have a quality problem.
The problem with synthetic data
Semi-blind validation – Take, for example, a direct-to-consumer predictive churn model. You need to confirm that your results are accurate; that the consumers predicted to churn will indeed churn. This cannot be done with “fake” data. You need real data to confirm your results in as validation test.
Network completeness – There are many AI models that use direct or indirect relationships between different data points, e.i.: people in an eCommerce site, or video game, social media site. Synthetic data can not reproduce the connection between different data points.
Behavioral significance – Some AI models heavily rely on behaviors in order to produce viable results, e.i.: navigating a map on a video game, product reviews, driving, etc. Similarly to “Network completeness,” there are behaviors that can not be synthetically produced.
Bias – Synthetic data generation can inadvertently introduce bias or inaccuracies if there is a flaw in the data generation algorithm.
Conclusion
Should you use synthetic data? The answer is “it depends on the use case.” One area where synthetic data helps is while you develop and test the scalability of the model. There are many benefits to using synthetic data in development, from security and privacy to data availability, to getting the “right” data set.
Should you use AI models trained with synthetic data in production? The answer is “it depends on the model use case” but generally speaking “NO” because it lacks all the needed nuances to train production models and produce workable results.
As I wrote above, synthetic data sets are particularly useful in some use cases, but, in many cases, they lack the needed elements for production models to provide verifiable actionable results that you can take to the bank.


Leave a comment