Introduction to synthesizer

Package version 0.4.0.

Use citation('synthesizer') to cite the package.

Introduction

synthetiser is an R package for quickly and easily synthesizing data. It also provides a few basic functions based on pMSE to measure some utility of the synthesized data.

The package supports numerical, categorical/ordinal, and mixed data and also correctly takes account of missing values and mixed distributions. A utility parameter lets you gradually shift between realistic data with high utility and less realistic data with decreased utility.

At the moment the method used seems promising but we are working on investigating where the method shines and where it fails. So we have no guarantees yet on utility, privacy, and so on. Having said that, our preliminary results are promising, and using the package is very easy.

Installation

The latest CRAN release can be installed as follows.

install.packages("synthesizer")

Next, the package can be loaded. You can use packageVersion (from base R) to check which version you have installed.

> library(synthesizer)
> # check the package version
> packageVersion("synthesizer")
[1] ‘0.4.0

A first example

We will use the iris dataset, that is built into R.

> data(iris)
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Creating a synthetic version of this dataset is easy.

> set.seed(1)
> synth_iris <- synthesize(iris)

To compare the datasets we can make some side-by-side scatterplots.

Original and Synthesized Iris
Original and Synthesized Iris

By default synthesize will return a dataset of the same size as the input dataset. However, it is possible to ask for any number of records.

> more_synth <- synthesize(iris, n=250)
> dim(more_synth)
[1] 250   5

Checking quality

The pMSE method is a popular way of measuring the quality of a dataset. The idea is to train a model to predict whether a record is synthetic or not. The worse a model can do that, the better a synthic data instance resembles the real data. The value scales between 0 and 0.25 (if the synthetic and real datasets have the same number of records). Smaller is better.

> pmse(synth=synth_iris, real=iris)
[1] 0.007844863

The package lets you choose between logistic regression (the default) and a random forest classifier as the predictive model.

> pmse(synth=synth_iris, real=iris, model="rf")
[1] 0.0921007

Choosing the utility-privacy trade-off

Synthetic data can be too realistic, in the sense that it might reveal actual properties of the real entities represented by synthetic data. One way to mitigate this is to decorrelate the variables in the synthetic data. For data frames, this can be done with the utility parameter. Either for all variables, or for a selectin of parameters. Setting utility to 1 (the default) yields the most realistic data, lowering the utility causes loss of (linear or nonlinear) correlation between synthetic variables, if there was any in the real data.

> # decorrelate rank matching to 0.5
> s1 <- synthesize(iris, utility=0.5)
> # decorrelate only Species
> s2 <- synthesize(iris, utility=c("Species"=0.5))
Two versions of syntetic iris
Two versions of syntetic iris

In the left figure, we show the three variables of a synthesized iris dataset, where all variables are decorrelated. Both the geometric clustering and the species are now garbled. In the right figure we only decorrelate the Species variable. Here, the spatial clustering is retained while the correlation between color (Species) and location is lost.

How it works

Synthetic data is prepared as follows.

Given an original dataset with n records:

  1. For each numeric variable in the dataset, determine the empirical inverse cumulative density function (ECDF), and use linear interpolation to interpolate between the data points. The observed minimum and maximum are also the minimum and maximum of the synthetic univariate distribution. Sample n values using inverse transform sampling with the linear interpolated inverse ECDF. Missing values are taken into account by sampling them proportional to their occurrence.
  2. For each categorical or logical variable, sample n values with replacement.
  3. Reorder the synthetic dataset such that the rank order combinations of the synthetic data match those of the original dataset. If any of the correlations is less than one, first randomly permute the rank correlations until correlation between original real and synthetic ranks drops below the specified value.

If less than m < n records are needed, sample m records uniformly from the dataset just created. If m > n records are needed, create m/n synthetic datasets of size m and sample uniformly m records from the combined data sets.