synthesizer
Package version 0.4.0.
Use citation('synthesizer')
to cite the package.
synthetiser
is an R package for quickly and easily
synthesizing data. It also provides a few basic functions based on pMSE
to measure some utility of the synthesized data.
The package supports numerical, categorical/ordinal, and mixed data
and also correctly takes account of missing values and mixed
distributions. A utility
parameter lets you gradually shift
between realistic data with high utility and less realistic data with
decreased utility.
At the moment the method used seems promising but we are working on investigating where the method shines and where it fails. So we have no guarantees yet on utility, privacy, and so on. Having said that, our preliminary results are promising, and using the package is very easy.
The latest CRAN release can be installed as follows.
install.packages("synthesizer")
Next, the package can be loaded. You can use
packageVersion
(from base R) to check which version you
have installed.
We will use the iris
dataset, that is built into R.
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Creating a synthetic version of this dataset is easy.
To compare the datasets we can make some side-by-side scatterplots.
By default synthesize
will return a dataset of the same
size as the input dataset. However, it is possible to ask for any number
of records.
The pMSE method is a popular way of measuring the quality of a dataset. The idea is to train a model to predict whether a record is synthetic or not. The worse a model can do that, the better a synthic data instance resembles the real data. The value scales between 0 and 0.25 (if the synthetic and real datasets have the same number of records). Smaller is better.
The package lets you choose between logistic regression (the default) and a random forest classifier as the predictive model.
Synthetic data can be too realistic, in the sense that it might
reveal actual properties of the real entities represented by synthetic
data. One way to mitigate this is to decorrelate the variables in the
synthetic data. For data frames, this can be done with the
utility
parameter. Either for all variables, or for a
selectin of parameters. Setting utility
to 1 (the default)
yields the most realistic data, lowering the utility causes loss of
(linear or nonlinear) correlation between synthetic variables, if there
was any in the real data.
> # decorrelate rank matching to 0.5
> s1 <- synthesize(iris, utility=0.5)
> # decorrelate only Species
> s2 <- synthesize(iris, utility=c("Species"=0.5))
In the left figure, we show the three variables of a synthesized
iris
dataset, where all variables are decorrelated. Both
the geometric clustering and the species are now garbled. In the right
figure we only decorrelate the Species variable. Here, the spatial
clustering is retained while the correlation between color (Species) and
location is lost.
Synthetic data is prepared as follows.
Given an original dataset with n records:
If less than m < n records are needed, sample m records uniformly from the dataset just created. If m > n records are needed, create ⌈m/n⌉ synthetic datasets of size m and sample uniformly m records from the combined data sets.