accumulate
Package version 0.9.3.
Use citation('accumulate')
to cite the package.
Accumulate
is a package for grouped aggregation, where
the groups can be dynamically collapsed into larger groups. When this
collapsing takes place and how collapsing takes place is
user-defined.
The latest CRAN release can be installed as follows.
install.packages("accumulate")
Next, the package can be loaded. You can use
packageVersion
(from base R) to check which version you
have installed.
We will use a built-in dataset as example.
> data(producers)
> head(producers)
sbi size industrial trade other other_income total
1 3410 8 151722 2135 0 -1775 152082
2 2840 7 50816 NA 158 949 59876
3 2752 5 4336 NA 0 36 4959
4 3120 6 18508 NA 0 80 20682
5 2524 7 21071 0 0 442 21513
6 3410 6 24220 1069 0 239 25528
This synthetic dataset contains information on various sources of
turnover from producers, that are labeled with an economic activity
classification (sbi
) and a size
class
(0-9).
We wish to find a group mean by sbi x size
. However, we
demand that the group has at least five records, otherwise we combine
the size classes of a single sbi
group. This can be done as
follows.
> a <- accumulate(producers
+ , collapse = sbi*size ~ sbi
+ , test = min_records(5)
+ , fun = mean, na.rm=TRUE)
> head(round(a))
sbi size level industrial trade other other_income total
1 3410 8 1 364397 2859 33 353 546117
2 2840 7 0 23160 823 49 329 25812
3 2752 5 NA NA NA NA NA NA
4 3120 6 0 20710 504 112 200 21702
5 2524 7 0 27954 1268 55 456 30468
6 3410 6 1 364397 2859 33 353 546117
The accumulate function does the following:
sbi
and size
occurring in the data, it checks whether test
is satisfied.
Here, it tests whether there are at least five records.
level
is set to 0
(no collapsing took place).sbi
as grouping variable for the current combination of
sbi
and size
. Then, if there are enough
records, the mean is computed for each variable and the output variable
level
is set to 1 (first level of collapsing has been
used).NA
for the current sbi
and
size
combination.Explicitly, for this example we see that for
(sbi,size)==(2752,5)
no satisfactory group of records was
found under the current collapsing scheme. Therefore the
level
variable equals NA
and all aggregated
variables are missing as well. For (sbi,size)==(2840,7)
there are sufficient records, and since level=0
no
collapsing was necessary. For the group (sbi,size)=(3410,8)
there were not enough records to compute a mean, but taking all records
in sbi==3410
gave enough records. This is signified by
level=1
, meaning that one collapsing step has taken place
(from sbi x size
to sbi
).
Let us see how we specified this call to accumulate
target groups ~ collapsing scheme
. The output is always at
the level of the target groups. The collapsing scheme determines which
records are used to compute a value for the target groups if the
test
is not satisfied.test
is a function that
should accept any subset of records of producers
and return
TRUE
or FALSE
. In this case we used the
convenience function min_records(5)
provided by
accumulate
. The function min_records()
creates
a testing function for us that we can pass as testing function.fun
is the aggregation function
that will be applied to each group.Observe that the accumulate function is similar to R’s built-in
aggregate
function (this is by design). There is a second
function called cumulate
that has an interface that is
similar to dplyr::summarise
.
> a <- cumulate(producers, collapse = sbi*size ~ sbi
+ , test = function(d) nrow(d) >= 5
+ , mu_industrial = mean(industrial, na.rm=TRUE)
+ , sd_industrial = sd(industrial, na.rm=TRUE))
> head(round(a))
sbi size level mu_industrial sd_industrial
1 3410 8 1 364397 535446
2 2840 7 0 23160 13937
3 2752 5 NA NA NA
4 3120 6 0 20710 21151
5 2524 7 0 27954 15089
6 3410 6 1 364397 535446
Notice that here, we wrote our own test function.
(sbi, size)
could not be
computed, even when collapsing to sbi
? (You need to run the
code and investigate the output).?mean
on how to compute
trimmed means.A collapsing scheme can be defined in a data frame or with a formula of the form
target grouping ~ collapse1 + collapse2 + ... + collapseN
Here, the target grouping
is a variable or product of
variables. Each collapse
term is also a variable or product
of variables. Each subsequent term defines the next collapsing step. Let
us show the idea with a more involved example.
The sbi
variable in the producers
dataset
encodes a hierarchical classification where longer digit sequences
indicate higher level of detail. Hence we can collapse to lower levels
of detail by deleting digits at the end. Let us enrich the
producers
dataset with extra grouping levels.
> producers$sbi3 <- substr(producers$sbi,1,3)
> producers$sbi2 <- substr(producers$sbi,1,2)
> head(producers,3)
sbi size industrial trade other other_income total sbi3 sbi2
1 3410 8 151722 2135 0 -1775 152082 341 34
2 2840 7 50816 NA 158 949 59876 284 28
3 2752 5 4336 NA 0 36 4959 275 27
We can now use a more involved collapsing scheme as follows.
> a <- accumulate(producers, collapse = sbi*size ~ sbi + sbi3 + sbi2
+ , test = min_records(5), fun = mean, na.rm=TRUE)
> head(round(a))
sbi size level industrial trade other other_income total
1 3410 8 1 364397 2859 33 353 546117
2 2840 7 0 23160 823 49 329 25812
3 2752 5 2 19526 39 52 151 20603
4 3120 6 0 20710 504 112 200 21702
5 2524 7 0 27954 1268 55 456 30468
6 3410 6 1 364397 2859 33 353 546117
For (sbi,size) == (2752,5)
we have 2 levels of
collapsing. In other words, for that aggregate, all records in
sbi3 == 275
were used.
trade
and
total
using the cumulate
function under the
same collapsing scheme as defined above.(sbi,size)
have been
collapsed to level 0, 1, 2, or 3. Tabulate them.sbi
code and compute the means of all variables.Collapsing schemes can be represented in data frames that have the form
[target group, parent of target group, parent of parent of target group,...].
The package comes with a helper function that creates such a scheme from hierarchical classifications that are encoded as digits.
For the sbi
example we can do the following to derive a
collapsing scheme.
> sbi <- unique(producers$sbi)
> csh <- csh_from_digits(sbi)
> names(csh)[1] <- "sbi"
> head(csh)
sbi A1 A2 A3 A4
1 3410 3410 341 34 3
2 2840 2840 284 28 2
3 2752 2752 275 27 2
4 3120 3120 312 31 3
5 2524 2524 252 25 2
6 2875 2875 287 28 2
Here, the column sbi
denotes the original (maximally)
5-digit codes, A1
the 4-digit codes, and so on. It is
important that the name of the first column matches a column in the data
to be agregated. Both cumlate
and accumulate
accept such a data frame as an argument. Here is an example with
cumulate
.
> a <- cumulate(producers, collapse = csh, test = function(d) nrow(d) >= 5
+ , mu_total = mean(total, na.rm=TRUE)
+ , sd_total = sd(total, na.rm=TRUE))
> head(a)
sbi level mu_total sd_total
1 3410 0 546117.22 844001.47
2 2840 0 31265.28 35053.37
3 2752 2 20603.08 31286.51
4 3120 0 26548.61 26784.60
5 2524 0 23434.68 18022.32
7 2875 0 15962.24 9640.14
In this representation is is not possible to use multiple grouping variables, unless you combine multiple grouping variables into a single one, for example by pasting them together.
The advantage of this representation is that it allows users to externally define a (manually edited) collapsing scheme.
csh
to compute the median of all numerical
variables of the producers
dataset with
accumulate
(hint: you need to remove the size
variable).There are several options to define test on groups of records:
min_records()
, min_complete()
, or
frac_complete()
.from_validator()
function.Let us look at a small example for each case. For comparison we will always test that there are a minimum of five records.
> # load the data again to loose columns 'sbi2' and 'sbi3' and work
> # with the original data.
> data(producers)
> # 1. using a helper function
> a <- accumulate(producers, collapse = sbi*size ~ sbi
+ , test = min_records(5)
+ , fun = mean, na.rm=TRUE)
> # 2. using a 'validator' object
> rules <- validate::validator(nrow(.) >= 5)
> a <- accumulate(producers, collapse = sbi*size ~ sbi
+ , test = from_validator(rules)
+ , fun = mean, na.rm=TRUE)
> # 3. using a custom function
> a <- accumulate(producers, collapse=sbi*size ~ sbi
+ , test = function(d) nrow(d) >= 5
+ , fun = mean, na.rm=TRUE)
An aggregate may be something more complex than a scalar. The
accumulate
package also supports complex aggregates such as
linear models.
> a <- cumulate(producers, collapse = sbi*size ~ sbi
+ , test = min_complete(5, c("other_income","trade"))
+ , model = lm(other_income ~ trade)
+ , mean_other = mean(other_income, na.rm=TRUE))
> head(a)
sbi size level model mean_other
1 3410 8 NA <logical> NA
2 2840 7 1 <lm> 249.3333
3 2752 5 NA <logical> NA
4 3120 6 0 <lm> 199.8889
5 2524 7 0 <lm> 456.2500
6 3410 6 NA <logical> NA
Here, we demand that there are at least five records available for estimating the model.
The linear models are stored in a list
of type
object_list
. Subsets or individual elements can be accessed
as usual with data frames.
> a$model[[1]]
[1] NA
> a$model[[2]]
Call:
lm(formula = other_income ~ trade)
Coefficients:
(Intercept) trade
221.59429 0.06937
If you write your own test function from scratch, it is easy to
overlook some edge cases like the occurrence of missing data, a column
that is completely NA
, or receiving zero records. The
function smoke_test()
accepts a data set and a test
function and runs the test function on several common edge cases based
on the dataset. It does not check whether the test function
works as expected, but it checks that the output is TRUE
or
FALSE
in all cases and reports errors, warnings and mesages
if they occur.
As an example we construct a test function that checks whether one of the variables has sufficient non-zero values.
> my_test <- function(d) sum(other != 0) > 3
> smoke_test(producers, my_test)
Test with full dataset raised issues.
ERR: object 'other' not found
Oops, we forgot to refer to the data set. Let’s try it again.
> my_test <- function(d) sum(d$other != 0) > 3
> smoke_test(producers, my_test)
Test with full dataset raised issues.
NA detected in output (must be TRUE or FALSE)
Test with first record and other is NA raised issues.
NA detected in output (must be TRUE or FALSE)
Test with first record and all values NA raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and sbi is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and size is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and industrial is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and trade is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and other is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and other_income is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Test with full dataset and total is NA for all records raised issues.
NA detected in output (must be TRUE or FALSE)
Our function is not robust against occurrence of NA
.
Here’s a third attempt.
sbi*size ~ sbi1 + sbi2
as collapsing scheme. Make sure
there are at least 10 records in each group.industrial
and
total
, but demand that there are not more than 20% zeros in
other
. Use csh
as collapsing scheme.