Enhanced Regular Expression as a DGL for Generation of Synthetic Big Data


Kai Cheng, Keisuke Abe, Journal of Information Processing Systems Vol. 19, No. 1, pp. 1-16, Feb. 2023  

https://doi.org/10.3745/JIPS.04.0262
Keywords: Big data analytics, Data Generation Language (DGL), Performance Analysis, regular expression, synthetic data generation, Type/format Inference
Fulltext:

Abstract

Synthetic data generation is generally used in performance evaluation and function tests in data-intensive applications, as well as in various areas of data analytics, such as privacy-preserving data publishing (PPDP) and statistical disclosure limit/control. A significant amount of research has been conducted on tools and languages for data generation. However, existing tools and languages have been developed for specific purposes and are unsuitable for other domains. In this article, we propose a regular expression-based data generation language (DGL) for flexible big data generation. To achieve a general-purpose and powerful DGL, we enhanced the standard regular expressions to support the data domain, type/format inference, sequence and random generation, probability distributions, and resource reference. To efficiently implement the proposed language, we propose caching techniques for both the intermediate and database queries. We evaluated the proposed improvement experimentally.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from November 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.




Cite this article
[APA Style]
Cheng, K. & Abe, K. (2023). Enhanced Regular Expression as a DGL for Generation of Synthetic Big Data. Journal of Information Processing Systems, 19(1), 1-16. DOI: 10.3745/JIPS.04.0262.

[IEEE Style]
K. Cheng and K. Abe, "Enhanced Regular Expression as a DGL for Generation of Synthetic Big Data," Journal of Information Processing Systems, vol. 19, no. 1, pp. 1-16, 2023. DOI: 10.3745/JIPS.04.0262.

[ACM Style]
Kai Cheng and Keisuke Abe. 2023. Enhanced Regular Expression as a DGL for Generation of Synthetic Big Data. Journal of Information Processing Systems, 19, 1, (2023), 1-16. DOI: 10.3745/JIPS.04.0262.