R packages by markvanderloo

stringdist - Approximate String Matching, Fuzzy Text Search, and String Distance Functions

Implements an approximate string matching version of R's native 'match' function. Also offers fuzzy text search based on various string distance measures. Can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q- gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences. This package is built for speed and runs in parallel by using 'openMP'. An API for C or C++ is exposed as well. Reference: MPJ van der Loo (2014) <doi:10.32614/RJ-2014-011>.

Last updated 4 months ago

openmp

15.54 score 327 stars 179 dependents 2.0k scripts 67k downloads

tinytest - Lightweight and Feature Complete Unit Testing Framework

Provides a lightweight (zero-dependency) and easy to use unit testing framework. Main features: install tests with the package. Test results are treated as data that can be stored and manipulated. Test files are R scripts interspersed with test commands, that can be programmed over. Fully automated build-install-test sequence for packages. Skip tests when not run locally (e.g. on CRAN). Flexible and configurable output printing. Compare computed output with output stored with the package. Run tests in parallel. Extensible by other packages. Report side effects.

Last updated 3 months ago

12.51 score 228 stars 7 dependents 574 scripts 23k downloads

validate - Data Validation Infrastructure

Declare data validation rules and data quality indicators; confront data with them and analyze or visualize the results. The package supports rules that are per-field, in-record, cross-record or cross-dataset. Rules can be automatically analyzed for rule type and connectivity. Supports checks implied by an SDMX DSD file as well. See also Van der Loo and De Jonge (2018) <doi:10.1002/9781118897126>, Chapter 6 and the JSS paper (2021) <doi:10.18637/jss.v097.i10>.

Last updated 25 days ago

data-cleaningvalidation

12.39 score 419 stars 8 dependents 448 scripts 2.8k downloads

gower - Gower's Distance

Compute Gower's distance (or similarity) coefficient between records. Compute the top-n matches between records. Core algorithms are executed in parallel on systems supporting OpenMP.

Last updated 10 months ago

openmp

11.19 score 29 stars 391 dependents 66 scripts 139k downloads

settings - Software Option Settings Manager for R

Provides option settings management that goes beyond R's default 'options' function. With this package, users can define their own option settings manager holding option names, default values and (if so desired) ranges or sets of allowed option values that will be automatically checked. Settings can then be retrieved, altered and reset to defaults with ease. For R programmers and package developers it offers cloning and merging functionality which allows for conveniently defining global and local options, possibly in a multilevel options hierarchy. See the package vignette for some examples concerning functions, S4 classes, and reference classes. There are convenience functions to reset par() and options() to their 'factory defaults'.

Last updated 10 months ago

9.32 score 7 stars 36 dependents 1.0k scripts 2.7k downloads

simputation - Simple Imputation

Easy to use interfaces to a number of imputation methods that fit in the not-a-pipe operator of the 'magrittr' package.

Last updated 8 months ago

data-scienceimputationofficialstatistics

8.42 score 91 stars 350 scripts 1.4k downloads

lumberjack - Track Changes in Data

A framework that allows for easy logging of changes in data. Main features: start tracking changes by adding a single line of code to an existing script. Track changes in multiple datasets, using multiple loggers. Add custom-built loggers or use loggers offered by other packages. <doi:10.18637/jss.v098.i01>.

Last updated 10 months ago

daffdatascienceloggingreproducible-research

7.13 score 66 stars 1 dependents 68 scripts 467 downloads

dcmodify - Modify Data Using Externally Defined Modification Rules

Data cleaning scripts typically contain a lot of 'if this change that' type of statements. Such statements are typically condensed expert knowledge. With this package, such 'data modifying rules' are taken out of the code and become in stead parameters to the work flow. This allows one to maintain, document, and reason about data modification rules as separate entities.

Last updated 10 months ago

6.24 score 10 stars 58 scripts 390 downloads

accumulate - Split-Apply-Combine with Dynamic Groups

Estimate group aggregates, where one can set user-defined conditions that each group of records must satisfy to be suitable for aggregation. If a group of records is not suitable, it is expanded using a collapsing scheme defined by the user. A paper on this package was published in the Journal of Statistical Software <doi:10.18637/jss.v112.i04>.

Last updated 11 days ago

5.35 score 9 stars 3 scripts 335 downloads

lintools - Manipulation of Linear Systems of (in)Equalities

Variable elimination (Gaussian elimination, Fourier-Motzkin elimination), Moore-Penrose pseudoinverse, reduction to reduced row echelon form, value substitution, projecting a vector on the convex polytope described by a system of (in)equations, simplify systems by removing spurious columns and rows and collapse implied equalities, test if a matrix is totally unimodular, compute variable ranges implied by linear (in)equalities.

Last updated 10 months ago

5.19 score 4 stars 2 dependents 13 scripts 490 downloads

synthesizer - Fast, Robust, and High-Quality Synthetic Data Generation with a Tuneable Privacy-Utility Trade-Off

Synthesize numeric, categorical, mixed and time series data. Data circumstances including mixed (or zero-inflated) distributions and missing data patterns are reproduced in the synthetic data. A single parameter allows balancing between high-quality synthetic data that represents correlations of the original data and lower quality but more privacy safe synthetic data without correlations. Tuning can be done per variable or for the whole dataset.

Last updated 25 days ago

4.60 score 8 scripts 317 downloads

deductive - Data Correction and Imputation Using Deductive Methods

Attempt to repair inconsistencies and missing values in data records by using information from valid values and validation rules restricting the data.

Last updated 2 months ago

data-cleaning

4.26 score 14 stars 13 scripts 589 downloads

deducorrect - Deductive Correction, Deductive Imputation, and Deterministic Correction

A collection of methods for automated data cleaning where all actions are logged. NOTE: active development has moved to the 'deductive' package.

Last updated 10 months ago

4.18 score 9 stars 34 scripts 844 downloads

hashr - Hash R Objects to Integers Fast

Apply an adaptation of the SuperFastHash algorithm to any R object. Hash whole R objects or, for vectors or lists, hash R objects to obtain a set of hash values that is stored in a structure equivalent to the input. See <http://www.azillionmonkeys.com/qed/hash.html> for a description of the hash algorithm.

Last updated 10 months ago

openmp

3.88 score 8 stars 19 scripts 213 downloads

rspa - Adapt Numerical Records to Fit (in)Equality Restrictions

Minimally adjust the values of numerical records in a data.frame, such that each record satisfies a predefined set of equality and/or inequality constraints. The constraints can be defined using the 'validate' package. The core algorithms have recently been moved to the 'lintools' package, refer to 'lintools' for a more basic interface and access to a version of the algorithm that works with sparse matrices.

Last updated 10 months ago

3.45 score 3 stars 19 scripts 356 downloads

extremevalues - Univariate Outlier Detection

Detect outliers in one-dimensional data.

Last updated 3 months ago

3.24 score 2 dependents 29 scripts 537 downloads