Accelerate on-cloud testing with synthetic data
By Abhay Goel and Parneeti Sood
Synthetic data, or artificially generated data, is used widely to conduct data science experiments, test datasets, and train machine learning models. It emulates production data in terms of variety and veracity, and helps users test their applications, pipelines, and analytics engines on the cloud without worrying about security risks or data transfer costs.
However, with various factors at play, technology leaders often face a conundrum of whether to use production data or synthetic data for testing. This blog decodes the complexities of synthetic data and delves into the key advantages of using it to eliminate on-cloud testing bottlenecks.
Lower security risk
With cybercrime on an unprecedented rise, data security is of paramount importance for businesses worldwide. Data engineering teams need to take stringent security measures while sharing data on a public cloud for testing purposes. Obscuring personally identifiable information (PII) is not enough, as production data still needs to be handled during the masking process and archived on systems that can potentially be compromised.
Picture this — when an entertainment giant recently released data displaying movie ratings, they removed usernames and randomized ID numbers. However, researchers were still able to de-anonymize some of this data by comparing rankings and timestamps with public information on other websites. Moreover, masking adds noise to the data, while randomization often dilutes essential patterns needed for problem-solving. Synthetic data helps address all these challenges and enables risk-free testing.
Reduced transfer cost
Production data needs to be prepared, managed, and transferred to cloud platforms before it can be used for testing. This cost can easily run into hundreds of thousands of dollars, especially with terabytes or petabytes of data. Often, additional costs are also incurred while transferring data to multiple cloud partners to complete proofs-of-concept (POC). Synthetic data is a much more cost-effective alternative, as it does not need to be transferred. It is also easier to manage and archive.
Faster time-to-market
While transferring terabytes of production data from on-premise storage to the cloud can take several hours, synthetic test data can be generated on the fly at a rate of thousands of rows per second. This helps developers save time and accelerate the testing process significantly.
Ease of operations
Data provisioning can and should be a simple, decentralized, self-service process that makes quality test data available in the shortest possible time. Synthetic data generation models and source code can easily be transferred via a secure CI/CD pipeline to generate terabytes of data. Users do not need to transfer data to multiple cloud platforms for completing POCs on a public cloud.
High-quality data
Proper testing usually requires different permutations of data, including negative test data and edge-case data. Data scientists are often forced to manually modify production data into usable values for their tests. Moreover, some test data is too complex or time-consuming to build manually. On the other hand, synthetic test data is generated based on test data scenarios that specify the necessary data patterns and permutations required to cover all edge cases of the test. This helps ensure quality data and makes the testing process more effective.
Best practices and techniques for synthetic data generation
While off-the-shelf tools like Synthetic Data Vault (SDV), Synthea, DATPROF, CA Test Data Manager, and Redgate can be used to generate synthetic datasets, the process can be challenging. To simulate real-world data accurately, many factors need to be carefully considered, including attribute values, data distribution, a correlation between attributes, multiple correlated tables, relational datasets, etc. Based on our experience with customers, here are a few practices and techniques that can help you generate quality synthetic data:
· Data generation according to the frequency distribution
You can use frequency distribution as the probability weight to generate series data by analyzing the data for its distribution. This includes capturing the frequency distribution of attributes and their covariance. Distribution describes the values of a column, and the covariance matrix describes their dependence on other columns. Together, these parameters serve as a generative model for that data to be generated. Further, by studying the statistical model of data using linear regression, the model can learn the weights of the best-fit line and easily create synthetic data. This helps generate test data while preserving the correlation of the original data. Similarly, studying the seasonality, trends, and correlation between data variables can help generate time-series data.
· Agent-based modeling
You can also build a model that explains the observed behavior of data and reproduces synthetic data. This can be done using the SDV tool, which helps identify the relationships between different attributes within a table and generates synthetic data accordingly. It can also identify foreign key relationships between different tables to generate synthetic data. In addition, the Bayesian network can be used to capture the correlation structure between attributes and draw samples from the trained model to construct datasets.
· Use of GANs for image generation
For certain artificial intelligence (AI) use cases, sizable volumes of quality images are needed to train deep learning models effectively. However, sourcing large, diverse, and accurately annotated sets of images can be an expensive proposition. The popular alternative is to augment images for this purpose using a combination of different techniques like rotation, flipping, translation, Gaussian noise addition, color space transformation, cropping, padding, blurring, etc. This process involves a lot of time and effort.
Generative Adversarial Networks (GANs) help address these challenges by generating artificial images that closely resemble real-world images (such as photos of human faces that don’t belong to real people). Leveraging a generator network and a discriminator (adversary) network, GANs generate realistic output images through random noise addition. This synthetic data helps development teams fast-track and improve the accuracy of the model training process.
In recent years, synthetic data has gained popularity across industries like BFSI, retail, and healthcare as it helps enterprises quickly validate new technologies in public cloud environments. When created with a sound understanding of data relationships, synthetic data can match the insights generated by real-world information while reducing business risks and overhead costs.