Title: | Univariate Outlier Detection |
---|---|
Description: | Detect outliers in one-dimensional data. |
Authors: | Mark van der Loo [cre, aut] |
Maintainer: | Mark van der Loo <[email protected]> |
License: | GPL-2 |
Version: | 2.4.1 |
Built: | 2025-01-16 05:53:49 UTC |
Source: | https://github.com/cran/extremevalues |
This package offers outlier detection and plot functions for univariate data.
The package is the implementation of the outlier detection methods introduced in the reference below. Briefly, the methods work as follows. Using a subset of the data, the parameters for a model distribution are estimated using regression of the sorted data on their QQ-plot positions.
A value in the data is an outlier when it is unlikely to be drawn from the
estimated distribution. There are two methods to determine the "unlikelyness".
The first, called "Method I", determines the value above which less than
observations are expected, given the total number of observations in
the data. Here
is a parameter which should have a value of 1 or
less. The second notion of unlikelyness uses the fit residuals. Extremely
large or small values are outliers when their residuals are above
or below a confidence limit
, to be determined by the user.
M.P.J. van der Loo, Distribution based outlier detection for univariate data. Discussion paper 10003, Statistics Netherlands, The Hague (2010). Available from www.markvanderloo.eu or www.cbs.nl.
getOutliers is a wrapper function for getOutliersI and getOutliersII.
getOutliers(y, method="I", ...) getOutliersI(y, rho=c(1,1), FLim=c(0.1,0.9), distribution="normal") getOutliersII(y, alpha=c(0.05, 0.05), FLim=c(0.1, 0.9), distribution="normal", returnResiduals=TRUE)
getOutliers(y, method="I", ...) getOutliersI(y, rho=c(1,1), FLim=c(0.1,0.9), distribution="normal") getOutliersII(y, alpha=c(0.05, 0.05), FLim=c(0.1, 0.9), distribution="normal", returnResiduals=TRUE)
y |
Vector of one-dimensional nonnegative data |
method |
"I" or "II" |
... |
Optional arguments to be passed to getOutliersI or getOutliersII |
distribution |
Model distribution used to estimate the limit. Choose from "lognormal", "exponential", "pareto", "weibull" or "normal" (default). |
FLim |
c(Fmin,Fmax) quantile limits indicating which data should be used to fit the model distribution. Must obey 0 < Fmin < Fmax < 1. |
rho |
(Method I) A value |
alpha |
(Method II) A value |
returnResiduals |
(Method II) Whether or not to return a vector of residuals from the fit |
Both methods use the subset of -values between the Fmin and Fmax quantiles
to fit a model cumulative density distribution. Method I detects outliers by checking
which are below (above) the limit where according to the model distribution less then
rho[1] (rho[2]) observations are expected (given length(y) observations). Method II
detects outliers by finding the observations (not used in the fit) who's fit residuals are
below (above) the estimated confidence limit alpha[1] (alpha[2]) while all lower (higher)
observations are outliers too.
nOut |
Number of left and right outliers. |
iLeft |
Index vector indicating left outliers in y |
iRight |
Index vector indicating right outiers in y |
limit |
For Method I: y-values below (above) limit[1] (limit[2]) are outliers. For Method II: elements with residuals below (above) limit[1] (limit[2]) are outliers if all smaller (larger) elements are outliers as well. |
method |
The used method: "method I" or "method II" |
distribution |
The used model distribution |
Fmin |
FLim[1] |
Fmax |
FLim[2] |
yMin |
Smallest y-value used in fit |
yMax |
Largest y-value used in fit |
Nfit |
Number of values used in the fit |
rho |
Method I, the input rho-values for left and right outliers |
alphaConf |
Method II, the input confidence levels for left and right outliers |
R2 |
R-squared value for the fit. Note that this is the ordinary least squares value, defined by
|
lambda |
(exponential distribution) Estimated location (and spread) parameter for |
mu |
(lognormal distribution) Estimated |
sigma |
(lognormal distribution) Estimated |
ym |
(pareto distribution) Estimated location parameter (mode) for pareto distribution |
alpha |
(pareto distribution) Estimated spread parameter for pareto distribution |
k |
(weibull distribution) estimated shape parameter |
lambda |
(weibull distribution) estimated scale parameter |
mu |
(normal distribution) Estimated |
sigma |
(normal distribution) Estimated |
Mark van der Loo, see www.markvanderloo.eu
M.P.J. van der Loo, Distribution based outlier detection for univariate data. Discussion paper 10003, Statistics Netherlands, The Hague. Available from www.markvanderloo.eu or www.cbs.nl.
The file <your R directory>/R-<version>/library/extremevalues/extremevalues.pdf contains a worked example. It can also be downloaded from my website.
y <- rlnorm(100) y <- c(0.1*min(y),y,10*max(y)) K <- getOutliers(y,method="I",distribution="lognormal") L <- getOutliers(y,method="II",distribution="lognormal") par(mfrow=c(1,2)) outlierPlot(y,K,mode="qq") outlierPlot(y,L,mode="residual")
y <- rlnorm(100) y <- c(0.1*min(y),y,10*max(y)) K <- getOutliers(y,method="I",distribution="lognormal") L <- getOutliers(y,method="II",distribution="lognormal") par(mfrow=c(1,2)) outlierPlot(y,K,mode="qq") outlierPlot(y,L,mode="residual")
Inverse error function
invErf(x)
invErf(x)
x |
(Vector of) real value(s) in the range (-1,1) |
(vector of) value(s) of the inverse error function
Mark van der Loo, www.markvanderloo.eu
x <-seq(-0.99,0.99,0.01); plot(x,invErf(x),'l');
x <-seq(-0.99,0.99,0.01); plot(x,invErf(x),'l');
This is a wrapper for two plot functions which can be used to analyse the results of outlier detection with the extremevalues package.
outlierPlot(y, L, mode="qq", ...) qqFitPlot(y, L, title=NA, xlab=NA, ylab=NA, fat=FALSE) plotMethodII(y, L, title=NA, xlab=NA, ylab=NA, fat=FALSE)
outlierPlot(y, L, mode="qq", ...) qqFitPlot(y, L, title=NA, xlab=NA, ylab=NA, fat=FALSE) plotMethodII(y, L, title=NA, xlab=NA, ylab=NA, fat=FALSE)
y |
A vector of values |
L |
The result of L <- getOutliers(y,...) |
mode |
Plot type. "qq" for Quantile-quantile plot with indicated outliers, "residual" for plot of fit residuals with indicated outliers (Method II only) |
... |
Optional arguments, to be transferred to qqFitPlot or plotMethodII (see below) |
title |
A custom title (must be a string) |
xlab |
A custom label for the x-axis (must be a string) |
ylab |
A custim label for the y-axis (must be a string) |
fat |
If TRUE, axis, fonts, labels, points and lines are thicker for export and publication |
Outliers are marked with a color or special symbol. If mode="qq":
observed agains predicted y-values are plotted. Points between vertical lines
were used in the fit. If L$method="Method I"
, horizontal lines indicate the
limits below (above) which observations are outliers. mode="residuals"
only works when L$Method="Method II"
. It generates a residual plot where
points between two vertical lines were used in the fit. Horizontal lines
indicate the computed confidence limits. The outermost points in the gray areas
are outliers.
Mark van der Loo, www.markvanderloo.eu
The file <your R directory>/R-<version>/library/extremevalues/extremevalues.pdf contains a worked example. It can also be downloaded from my website.
y <- rlnorm(100) y <- c(0.1*min(y),y,10*max(y)) K <- getOutliers(y,method="I",distribution="lognormal") L <- getOutliers(y,method="II",distribution="lognormal") par(mfrow=c(1,2)) outlierPlot(y,K,mode="qq") outlierPlot(y,L,mode="residual")
y <- rlnorm(100) y <- c(0.1*min(y),y,10*max(y)) K <- getOutliers(y,method="I",distribution="lognormal") L <- getOutliers(y,method="II",distribution="lognormal") par(mfrow=c(1,2)) outlierPlot(y,K,mode="qq") outlierPlot(y,L,mode="residual")
Pareto density distribution, quantile function and random generator.
dpareto(x, xm=1, alpha=1) qpareto(p, xm=1, alpha=1) rpareto(n, xm=1, alpha=1)
dpareto(x, xm=1, alpha=1) qpareto(p, xm=1, alpha=1) rpareto(n, xm=1, alpha=1)
xm |
location parameter (mode of distribution) |
alpha |
spread parameter |
x |
Vector of realizations |
p |
Vector of probabilities |
n |
number of samples to draw |
dpareto |
Probability density |
qpareto |
Quantile at probability p (inverse cdf) |
rpareto |
Random value |
Mark van der Loo www.markvanderloo.eu
q <- qpareto(0.5);
q <- qpareto(0.5);