Synthetic Data Generation: A Comparative Study

This alert has been successfully added and will be sent to: You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below. Manage my Alerts

New Citation Alert!

Abstract

Generating synthetic data similar to realistic data is a crucial task in data augmentation and data production. Due to the preservation of authentic data distribution, synthetic data provide concealment of sensitive information and therefore enable Big Data acquisition for model training without facing privacy challenges. Nevertheless, the obstacles arise starting with acquiring real-world open-source data to effectively synthesizing new samples as genuine as possible. In this paper, a comparative study is conducted by considering the efficacy of different generative models like Generative Adversarial Networks (GAN), Variational Autoencoder (VAE), Synthetic Minority Oversampling Technique (SMOTE), Data Synthesizer (DS), Synthetic Data Vault with Gaussian Copula (SDV-G), Conditional Generative Adversarial Networks (SDV-GAN), and SynthPop Non-Parametric (SP-NP) approach to synthesize data with regard to various datasets. We used the pairwise correlation and Synthetic Data (SD) metrics as utility measures respectively between real data and generated data for evaluation. Accordingly, this paper investigates the effects of various data generation models, and the processing time of every model is included as one of the evaluation metrics.

Supplementary Material

Presentation slides (p94-venugopal-supplement.pptx)